Data 622 Homework 3

Objective

Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

Car Insurance Claim Prediction

As part of this Homework I decided to the use the (Kaggle data) [https://www.kaggle.com/datasets/ifteshanajnin/carinsuranceclaimprediction-classification]

The dataset contains 58592 observations on 44 variables. variables. As part of this exercise , I will analyzing the data and predicting whether there is an insurance claim . I will be solving this classification problem using Decision Tree,Random Forest and SVM. Then I will be evaluating the performance of all the models and comparing the results.

Load the required libraries

library(stats)
library(corrplot)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(caret)
library(rpart.plot)
library(DataExplorer)
library(skimr)

library(performanceEstimation)
library(randomForest)

library(e1071)

Exploratory Data analysis

trainData <- read_csv("train.csv", col_types = "cnnnfnffffffffffffffffffffffffffffffffffffff")
dim(trainData)

## [1] 58592    44

Analyse the data

Let’s analyze the distribution of various feature and it’s values.

# Snippet of the data
head(trainData)

## # A tibble: 6 x 44
##   policy_id policy_tenure age_of_car age_of_policyholder area_cluster
##   <chr>             <dbl>      <dbl>               <dbl> <fct>       
## 1 ID00001           0.516       0.05               0.644 C1          
## 2 ID00002           0.673       0.02               0.375 C2          
## 3 ID00003           0.841       0.02               0.385 C3          
## 4 ID00004           0.900       0.11               0.433 C4          
## 5 ID00005           0.596       0.11               0.635 C5          
## 6 ID00006           1.02        0.07               0.519 C6          
## # ... with 39 more variables: population_density <dbl>, make <fct>,
## #   segment <fct>, model <fct>, fuel_type <fct>, max_torque <fct>,
## #   max_power <fct>, engine_type <fct>, airbags <fct>, is_esc <fct>,
## #   is_adjustable_steering <fct>, is_tpms <fct>, is_parking_sensors <fct>,
## #   is_parking_camera <fct>, rear_brakes_type <fct>, displacement <fct>,
## #   cylinder <fct>, transmission_type <fct>, gear_box <fct>,
## #   steering_type <fct>, turning_radius <fct>, length <fct>, width <fct>, ...

# glimpse and summary of the data
glimpse(trainData)

## Rows: 58,592
## Columns: 44
## $ policy_id                        <chr> "ID00001", "ID00002", "ID00003", "ID0~
## $ policy_tenure                    <dbl> 0.51587359, 0.67261851, 0.84111026, 0~
## $ age_of_car                       <dbl> 0.05, 0.02, 0.02, 0.11, 0.11, 0.07, 0~
## $ age_of_policyholder              <dbl> 0.6442308, 0.3750000, 0.3846154, 0.43~
## $ area_cluster                     <fct> C1, C2, C3, C4, C5, C6, C7, C8, C7, C~
## $ population_density               <dbl> 4990, 27003, 4076, 21622, 34738, 1305~
## $ make                             <fct> 1, 1, 1, 1, 2, 3, 4, 1, 3, 1, 1, 1, 1~
## $ segment                          <fct> A, A, A, C1, A, C2, B2, B2, C2, B2, A~
## $ model                            <fct> M1, M1, M1, M2, M3, M4, M5, M6, M4, M~
## $ fuel_type                        <fct> CNG, CNG, CNG, Petrol, Petrol, Diesel~
## $ max_torque                       <fct> 60Nm@3500rpm, 60Nm@3500rpm, 60Nm@3500~
## $ max_power                        <fct> 40.36bhp@6000rpm, 40.36bhp@6000rpm, 4~
## $ engine_type                      <fct> F8D Petrol Engine, F8D Petrol Engine,~
## $ airbags                          <fct> 2, 2, 2, 2, 2, 6, 2, 2, 6, 6, 2, 2, 2~
## $ is_esc                           <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_adjustable_steering           <fct> No, No, No, Yes, No, Yes, Yes, Yes, Y~
## $ is_tpms                          <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_parking_sensors               <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes~
## $ is_parking_camera                <fct> No, No, No, Yes, Yes, Yes, No, No, Ye~
## $ rear_brakes_type                 <fct> Drum, Drum, Drum, Drum, Drum, Disc, D~
## $ displacement                     <fct> 796, 796, 796, 1197, 999, 1493, 1497,~
## $ cylinder                         <fct> 3, 3, 3, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4~
## $ transmission_type                <fct> Manual, Manual, Manual, Automatic, Au~
## $ gear_box                         <fct> 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5~
## $ steering_type                    <fct> Power, Power, Power, Electric, Electr~
## $ turning_radius                   <fct> 4.6, 4.6, 4.6, 4.8, 5, 5.2, 5, 4.8, 5~
## $ length                           <fct> 3445, 3445, 3445, 3995, 3731, 4300, 3~
## $ width                            <fct> 1515, 1515, 1515, 1735, 1579, 1790, 1~
## $ height                           <fct> 1475, 1475, 1475, 1515, 1490, 1635, 1~
## $ gross_weight                     <fct> 1185, 1185, 1185, 1335, 1155, 1720, 1~
## $ is_front_fog_lights              <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_rear_window_wiper             <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_washer            <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_defogger          <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_brake_assist                  <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_power_door_locks              <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_central_locking               <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_power_steering                <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ is_driver_seat_height_adjustable <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_day_night_rear_view_mirror    <fct> No, No, No, Yes, Yes, No, No, Yes, No~
## $ is_ecw                           <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_speed_alert                   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ ncap_rating                      <fct> 0, 0, 0, 2, 2, 3, 5, 2, 3, 0, 0, 2, 2~
## $ is_claim                         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~

summary(trainData)

##   policy_id         policy_tenure        age_of_car      age_of_policyholder
##  Length:58592       Min.   :0.002735   Min.   :0.00000   Min.   :0.2885     
##  Class :character   1st Qu.:0.210250   1st Qu.:0.02000   1st Qu.:0.3654     
##  Mode  :character   Median :0.573792   Median :0.06000   Median :0.4519     
##                     Mean   :0.611246   Mean   :0.06942   Mean   :0.4694     
##                     3rd Qu.:1.039104   3rd Qu.:0.11000   3rd Qu.:0.5481     
##                     Max.   :1.396641   Max.   :1.00000   Max.   :1.0000     
##                                                                             
##   area_cluster   population_density make         segment          model      
##  C8     :13654   Min.   :  290      1:38126   A      :17321   M1     :14948  
##  C2     : 7342   1st Qu.: 6112      2: 2373   C1     : 3557   M4     :14018  
##  C5     : 6979   Median : 8794      3:14018   C2     :14018   M6     :13776  
##  C3     : 6101   Mean   :18827      4: 1961   B2     :18314   M8     : 4173  
##  C14    : 3660   3rd Qu.:27003      5: 2114   B1     : 4173   M7     : 2940  
##  C13    : 3423   Max.   :73430                Utility: 1209   M3     : 2373  
##  (Other):17433                                                (Other): 6364  
##   fuel_type              max_torque                max_power    
##  CNG   :20330   113Nm@4400rpm :17796   88.50bhp@6000rpm :17796  
##  Petrol:20532   60Nm@3500rpm  :14948   40.36bhp@6000rpm :14948  
##  Diesel:17730   250Nm@2750rpm :14018   113.45bhp@4000rpm:14018  
##                 82.1Nm@3400rpm: 4173   55.92bhp@5300rpm : 4173  
##                 91Nm@4250rpm  : 2373   67.06bhp@5500rpm : 2373  
##                 200Nm@1750rpm : 2114   97.89bhp@3600rpm : 2114  
##                 (Other)       : 3170   (Other)          : 3170  
##                 engine_type    airbags   is_esc      is_adjustable_steering
##  F8D Petrol Engine    :14948   2:40425   No :40191   No :23066             
##  1.5 L U2 CRDi        :14018   6:16958   Yes:18401   Yes:35526             
##  K Series Dual jet    :13776   1: 1209                                     
##  K10C                 : 4173                                               
##  1.2 L K Series Engine: 2940                                               
##  1.0 SCe              : 2373                                               
##  (Other)              : 6364                                               
##  is_tpms     is_parking_sensors is_parking_camera rear_brakes_type
##  No :44574   Yes:56219          No :35704         Drum:44574      
##  Yes:14018   No : 2373          Yes:22888         Disc:14018      
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##   displacement   cylinder  transmission_type gear_box   steering_type  
##  1197   :17796   3:21857   Manual   :38181   5:44211   Power   :33502  
##  796    :14948   4:36735   Automatic:20411   6:14381   Electric:23881  
##  1493   :14018                                         Manual  : 1209  
##  998    : 4173                                                         
##  999    : 2373                                                         
##  1498   : 2114                                                         
##  (Other): 3170                                                         
##  turning_radius      length          width           height     
##  4.6    :14948   3445   :14948   1515   :14948   1475   :14948  
##  4.8    :14856   4300   :14018   1735   :14856   1635   :14018  
##  5.2    :14018   3845   :13776   1790   :14018   1530   :13776  
##  4.7    : 4173   3990   : 4538   1620   : 4173   1675   : 4173  
##  5      : 3971   3655   : 4173   1745   : 2940   1500   : 2940  
##  4.85   : 2940   3995   : 3194   1579   : 2373   1490   : 2373  
##  (Other): 3686   (Other): 3945   (Other): 5284   (Other): 6364  
##   gross_weight   is_front_fog_lights is_rear_window_wiper is_rear_window_washer
##  1185   :14948   No :24664           No :41634            No :41634            
##  1335   :14856   Yes:33928           Yes:16958            Yes:16958            
##  1720   :14018                                                                 
##  1340   : 4173                                                                 
##  1410   : 2940                                                                 
##  1155   : 2373                                                                 
##  (Other): 5284                                                                 
##  is_rear_window_defogger is_brake_assist is_power_door_locks is_central_locking
##  No :38077               No :26415       No :16157           No :16157         
##  Yes:20515               Yes:32177       Yes:42435           Yes:42435         
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  is_power_steering is_driver_seat_height_adjustable
##  Yes:57383         No :24301                       
##  No : 1209         Yes:34291                       
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##  is_day_night_rear_view_mirror is_ecw      is_speed_alert ncap_rating is_claim 
##  No :36309                     No :16157   Yes:58229      0:19097     0:54844  
##  Yes:22283                     Yes:42435   No :  363      2:21402     1: 3748  
##                                                           3:14018              
##                                                           5: 1961              
##                                                           4: 2114              
##                                                                                
##

plot_missing(trainData)

#hist(trainData$is_claim)

From our analysis it is clear that there are no missing data.

Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:

skimr::skim(trainData)

Data summary
Name	trainData
Number of rows	58592
Number of columns	44
_______________________
Column type frequency:
character	1
factor	39
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
policy_id	0	1	7	7	0	58592	0

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
area_cluster	1	FALSE	22	C8: 13654, C2: 7342, C5: 6979, C3: 6101
make	1	FALSE	5	1: 38126, 3: 14018, 2: 2373, 5: 2114
segment	1	FALSE	6	B2: 18314, A: 17321, C2: 14018, B1: 4173
model	1	FALSE	11	M1: 14948, M4: 14018, M6: 13776, M8: 4173
fuel_type	1	FALSE	3	Pet: 20532, CNG: 20330, Die: 17730
max_torque	1	FALSE	9	113: 17796, 60N: 14948, 250: 14018, 82.: 4173
max_power	1	FALSE	9	88.: 17796, 40.: 14948, 113: 14018, 55.: 4173
engine_type	1	FALSE	11	F8D: 14948, 1.5: 14018, K S: 13776, K10: 4173
airbags	1	FALSE	3	2: 40425, 6: 16958, 1: 1209
is_esc	1	FALSE	2	No: 40191, Yes: 18401
is_adjustable_steering	1	FALSE	2	Yes: 35526, No: 23066
is_tpms	1	FALSE	2	No: 44574, Yes: 14018
is_parking_sensors	1	FALSE	2	Yes: 56219, No: 2373
is_parking_camera	1	FALSE	2	No: 35704, Yes: 22888
rear_brakes_type	1	FALSE	2	Dru: 44574, Dis: 14018
displacement	1	FALSE	9	119: 17796, 796: 14948, 149: 14018, 998: 4173
cylinder	1	FALSE	2	4: 36735, 3: 21857
transmission_type	1	FALSE	2	Man: 38181, Aut: 20411
gear_box	1	FALSE	2	5: 44211, 6: 14381
steering_type	1	FALSE	3	Pow: 33502, Ele: 23881, Man: 1209
turning_radius	1	FALSE	9	4.6: 14948, 4.8: 14856, 5.2: 14018, 4.7: 4173
length	1	FALSE	9	344: 14948, 430: 14018, 384: 13776, 399: 4538
width	1	FALSE	10	151: 14948, 173: 14856, 179: 14018, 162: 4173
height	1	FALSE	11	147: 14948, 163: 14018, 153: 13776, 167: 4173
gross_weight	1	FALSE	10	118: 14948, 133: 14856, 172: 14018, 134: 4173
is_front_fog_lights	1	FALSE	2	Yes: 33928, No: 24664
is_rear_window_wiper	1	FALSE	2	No: 41634, Yes: 16958
is_rear_window_washer	1	FALSE	2	No: 41634, Yes: 16958
is_rear_window_defogger	1	FALSE	2	No: 38077, Yes: 20515
is_brake_assist	1	FALSE	2	Yes: 32177, No: 26415
is_power_door_locks	1	FALSE	2	Yes: 42435, No: 16157
is_central_locking	1	FALSE	2	Yes: 42435, No: 16157
is_power_steering	1	FALSE	2	Yes: 57383, No: 1209
is_driver_seat_height_adjustable	1	FALSE	2	Yes: 34291, No: 24301
is_day_night_rear_view_mirror	1	FALSE	2	No: 36309, Yes: 22283
is_ecw	1	FALSE	2	Yes: 42435, No: 16157
is_speed_alert	1	FALSE	2	Yes: 58229, No: 363
ncap_rating	1	FALSE	5	2: 21402, 0: 19097, 3: 14018, 4: 2114
is_claim	1	FALSE	2	0: 54844, 1: 3748

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
policy_tenure	1	0.61	0.41	0.00	0.21	0.57	1.04	1.4	▇▅▃▆▃
age_of_car	1	0.07	0.06	0.00	0.02	0.06	0.11	1.0	▇▁▁▁▁
age_of_policyholder	1	0.47	0.12	0.29	0.37	0.45	0.55	1.0	▇▇▃▁▁
population_density	1	18826.86	17660.17	290.00	6112.00	8794.00	27003.00	73430.0	▇▃▂▁▁

Data Visualization

trainData %>% 
  ggplot(aes(x = policy_tenure, fill = is_claim)) +
  geom_boxplot()

trainData %>% 
  ggplot(aes(x = age_of_car, fill = is_claim)) +
  geom_boxplot()

trainData %>% 
  ggplot(aes(x = age_of_policyholder, fill = is_claim)) +
  geom_boxplot()

trainData %>% 
  ggplot(aes(x = population_density, fill = is_claim)) +
  geom_boxplot()

Distribution of the data across categorical variable

long_df1 <- trainData %>%  select(c('is_speed_alert','ncap_rating','is_claim')) %>%   pivot_longer(cols=-c(is_claim), names_to='kpi')


ggplot(long_df1, aes(x=value, fill=is_claim)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64")) + 
  ggtitle('Comparing Categorical Features and Target')

long_df2 <- trainData %>%  select(c('segment','area_cluster','is_claim')) %>%   pivot_longer(cols=-c(is_claim), names_to='kpi')

ggplot(long_df2, aes(x=value, fill=is_claim)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64"))   + ggtitle('Comparing Categorical Features and Target')

long_df3 <- trainData %>%  select(c('model','is_parking_camera','is_claim')) %>%   pivot_longer(cols=-c(is_claim), names_to='kpi')

We are considering only major categorical variable for the analysis

As per our analysis , we could see only a subset of features are relevant for our analysis

trainData2 <- trainData %>% select (policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,max_torque,
                                    is_parking_sensors,is_parking_camera,is_brake_assist,is_power_door_locks,is_central_locking,is_speed_alert,ncap_rating,is_claim  )

Target Variable Distribution

Let’s visualize our target distribution.

# Bar plot for target (Heart disease) 
ggplot(trainData, aes(x=is_claim, fill=is_claim)) + 
  geom_bar() +
  xlab("Insurance Claim") +
  ylab("Count") +
  ggtitle("Analysis of Insurance Claim") +
  scale_fill_discrete(name = "Insurance Claim", labels = c("Absence", "Presence"))

round(prop.table(table(select(trainData, 'is_claim'))),2)

## 
##    0    1 
## 0.94 0.06

Class imbalance

According to our analysis we could see there is a class imbalance present in this dataset. Imbalanced data can prove to be quite problematic in classifying the minority class. I will conduct oversampling using SMOTE before modeling the data to give the model a better chance at classifying the minority class.

One way to address this imbalance problem is to use Synthetic Minority Oversampling Technique, often abbreviated SMOTE. This technique involves creating a new dataset by oversampling observations from the minority class, which produces a dataset that has more balanced classes.

As per our analysis , we could see only a subset of features are relevant for our analysis

trainData2 <- trainData %>% select (policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,max_torque,
                                    is_parking_sensors,is_parking_camera,is_brake_assist,is_power_door_locks,is_central_locking,is_speed_alert,ncap_rating,is_claim  )

Splitting the Dataset into Training and Testing

Our first step will be to separate the data into a training and test dataset. This way, we can test the accuracy by using our holdout test dataset later on. We decided to perform a 80/20 training to testing data split on our dataset.

# Splitting the data 80/20
set.seed(1234)

training.samples <- trainData2 $is_claim %>% 
  createDataPartition(p = 0.8, list=FALSE)

train.data <- trainData2  [training.samples,]
test.data <- trainData[-training.samples,]

round(prop.table(table(select(train.data, 'is_claim'))),2)

## 
##    0    1 
## 0.94 0.06

round(prop.table(table(select(test.data, 'is_claim'))),2)

## 
##    0    1 
## 0.94 0.06

We could see that the target variable is uniformly distributed both in test and train data in accordance with original data.

Synthetic Minority Oversampling Technique (SMOTE)

Since there is a class imbalance of the target variable , we will solve this by oversampling the minority class.

set.seed(12345)
train.balanced <- smote(is_claim ~ ., data = train.data, perc.over = 1)

train.balanced %>% ggplot(aes(is_claim)) +
  geom_bar(fill = "#04354F") +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")

Decision tree

DT_modelSub1 <- rpart(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist, data = train.balanced, method = 'class')
rpart.plot(DT_modelSub1)

predSub1 <-predict(DT_modelSub1, test.data, type="class")
dtSub1CM <- confusionMatrix(data = predSub1, reference = test.data$is_claim, positive = "1")
dtSub1CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5565  226
##          1 5403  523
##                                           
##                Accuracy : 0.5196          
##                  95% CI : (0.5105, 0.5287)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0487          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.69826         
##             Specificity : 0.50739         
##          Pos Pred Value : 0.08826         
##          Neg Pred Value : 0.96097         
##              Prevalence : 0.06392         
##          Detection Rate : 0.04464         
##    Detection Prevalence : 0.50576         
##       Balanced Accuracy : 0.60282         
##                                           
##        'Positive' Class : 1               
##

Measure the performance

We could see that the performance has improved when we are considering only few relevant features and decision tree model is considering one more additional parameter is_break_assist.

Now the Accuracy is 51.9% and Sensitivity is 69.82% and Specificity is 50.73%

Random Forest

set.seed(123)
fit.forest <- randomForest(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist, data = train.balanced, importance=TRUE, ntree=200)

# display model details
fit.forest

## 
## Call:
##  randomForest(formula = is_claim ~ age_of_car + age_of_policyholder +      population_density + policy_tenure + is_parking_sensors +      is_power_door_locks + ncap_rating + is_brake_assist, data = train.balanced,      importance = TRUE, ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 31.9%
## Confusion matrix:
##      0    1 class.error
## 0 3468 2530   0.4218073
## 1 1297 4701   0.2162387

rf.pred <- predict(fit.forest, newdata=test.data, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, test.data$is_claim))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6258  280
##          1 4710  469
##                                           
##                Accuracy : 0.5741          
##                  95% CI : (0.5651, 0.5831)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0524          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.57057         
##             Specificity : 0.62617         
##          Pos Pred Value : 0.95717         
##          Neg Pred Value : 0.09056         
##              Prevalence : 0.93608         
##          Detection Rate : 0.53410         
##    Detection Prevalence : 0.55799         
##       Balanced Accuracy : 0.59837         
##                                           
##        'Positive' Class : 0               
##

Analysis

We could see that Accuracy of random forest model is 57.4% , Sensitivity is 57.05% and Specificity is 62.6% which is better than the Decision Tree

Support Vector Machine Model:

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane.

SVM Kernel Functions

SVM algorithms use a group of mathematical functions that are known as kernels. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions

Basically Kernel represents the style of SVM that is used to classify data. Here we will apply different kernal functions and then we will verify which one give more accurate result.

Linear Kernel

classifier1 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist,
                 data = train.balanced,
                 type = 'C-classification',
                 kernel = 'linear')

Prediction and Evaluation

predSvm1 <- predict(classifier1, newdata = test.data)
result <- table(test.data$is_claim, predSvm1)

cmSvm1 <- confusionMatrix(data = predSvm1, reference = test.data$is_claim, positive = "1")
cmSvm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5877  298
##          1 5091  451
##                                          
##                Accuracy : 0.5401         
##                  95% CI : (0.531, 0.5491)
##     No Information Rate : 0.9361         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0347         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.60214        
##             Specificity : 0.53583        
##          Pos Pred Value : 0.08138        
##          Neg Pred Value : 0.95174        
##              Prevalence : 0.06392        
##          Detection Rate : 0.03849        
##    Detection Prevalence : 0.47299        
##       Balanced Accuracy : 0.56898        
##                                          
##        'Positive' Class : 1              
##

Kernel tuning

We will now apply few other popular kernels like radial,polynomial and sigmoid and then compare our previous results against linear kernal.

classifier2 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist,
                 data = train.balanced,
                 type = 'C-classification',
                 kernel = 'radial')

predSvm2 <- predict(classifier2, newdata = test.data)

cmSvm2 <- confusionMatrix(data = predSvm2, reference = test.data$is_claim, positive = "1")
cmSvm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4815  190
##          1 6153  559
##                                           
##                Accuracy : 0.4586          
##                  95% CI : (0.4496, 0.4677)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0394          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.74633         
##             Specificity : 0.43900         
##          Pos Pred Value : 0.08328         
##          Neg Pred Value : 0.96204         
##              Prevalence : 0.06392         
##          Detection Rate : 0.04771         
##    Detection Prevalence : 0.57284         
##       Balanced Accuracy : 0.59267         
##                                           
##        'Positive' Class : 1               
##

Polynomial Kernel

classifier3 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist,
                 data = train.balanced,
                 type = 'C-classification',
                 kernel = 'polynomial')
predSvm3 <- predict(classifier3, newdata = test.data)

cmSvm3 <- confusionMatrix(data = predSvm3, reference = test.data$is_claim, positive = "1")
cmSvm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3943  147
##          1 7025  602
##                                           
##                Accuracy : 0.3879          
##                  95% CI : (0.3791, 0.3968)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0309          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.80374         
##             Specificity : 0.35950         
##          Pos Pred Value : 0.07893         
##          Neg Pred Value : 0.96406         
##              Prevalence : 0.06392         
##          Detection Rate : 0.05138         
##    Detection Prevalence : 0.65093         
##       Balanced Accuracy : 0.58162         
##                                           
##        'Positive' Class : 1               
##

Sigmoid Kernel

classifier4 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist,
                 data = train.balanced,
                 type = 'C-classification',
                 kernel = 'sigmoid')

predSvm4 <- predict(classifier4, newdata = test.data)

cmSvm4 <- confusionMatrix(data = predSvm4, reference = test.data$is_claim, positive = "1")
cmSvm4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5699  379
##          1 5269  370
##                                           
##                Accuracy : 0.518           
##                  95% CI : (0.5089, 0.5271)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0034          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.49399         
##             Specificity : 0.51960         
##          Pos Pred Value : 0.06561         
##          Neg Pred Value : 0.93764         
##              Prevalence : 0.06392         
##          Detection Rate : 0.03158         
##    Detection Prevalence : 0.48127         
##       Balanced Accuracy : 0.50680         
##                                           
##        'Positive' Class : 1               
##

Summary

From our analysis we could see that out of 4 different SVM kernals we tried , linear kernal gives the best accuracy (Accuracy : 54.01 Sensitivity : 60.2% , Specificity : 53.58% ). But when we compare the results against Random Forest , we noticed that Random Forest has performed better (Accuracy : 57.41%,Sensitivity : 57.05%, Specificity :62.61%).

Also it indicates the kernal function couldn’t bring in any clear margin of separation between 2 classes to provide a higher accuracy.

So it is clear that there are some sort of linear separation exist between those classes , but that alone is not enough to provide a better accuracy. It seems like it is the presence of large number of categorical variable which makes the Random Forest better in this classification problem.

** Couple of academic research articles comparing the decision tree vs SVM.**

The below article uses Support Vector Machine and Decision tree to identify the prognosis of metformin poisoning in the United States by the analysis of National Poisoning Data System. According to their study they found that the accuracy of the SVM model in predicting the prognosis of metformin poisoning was higher than the DT model.

https://bmcpharmacoltoxicol.biomedcentral.com/articles/10.1186/s40360-022-00588-0

Another article which does a comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-319

According to this article they found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines.

So we can conclude that the accuracy of different classification algorithm varies based on the data set , noise, outliers, number of features and their type (numerical and categorical ) , volume of the data etc. Also we could notice that for certain class of algorith does a better job with certain type of data.

Which algorithm is recommended to get more accurate results?

SVM uses kernel functions to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem. Decision trees are better for categorical data and it deals colinearity better than SVM

Is it better for classification or regression scenarios?

Eventhough SVM can be used for both regression as well as classification problem , it is widely used for classification problem.