Search for academic content (at least 3 articles) that compare the use of decision trees vs SVMs in your current area of expertise. Perform an analysis of the dataset used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.
As part of this Homework I decided to the use the (Kaggle data) [https://www.kaggle.com/datasets/ifteshanajnin/carinsuranceclaimprediction-classification]
The dataset contains 58592 observations on 44 variables. variables. As part of this exercise , I will analyzing the data and predicting whether there is an insurance claim . I will be solving this classification problem using Decision Tree,Random Forest and SVM. Then I will be evaluating the performance of all the models and comparing the results.
Load the required libraries
library(stats)
library(corrplot)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(caret)
library(rpart.plot)
library(DataExplorer)
library(skimr)
library(performanceEstimation)
library(randomForest)
library(e1071)
trainData <- read_csv("train.csv", col_types = "cnnnfnffffffffffffffffffffffffffffffffffffff")
dim(trainData)
## [1] 58592 44
Let’s analyze the distribution of various feature and it’s values.
# Snippet of the data
head(trainData)
## # A tibble: 6 x 44
## policy_id policy_tenure age_of_car age_of_policyholder area_cluster
## <chr> <dbl> <dbl> <dbl> <fct>
## 1 ID00001 0.516 0.05 0.644 C1
## 2 ID00002 0.673 0.02 0.375 C2
## 3 ID00003 0.841 0.02 0.385 C3
## 4 ID00004 0.900 0.11 0.433 C4
## 5 ID00005 0.596 0.11 0.635 C5
## 6 ID00006 1.02 0.07 0.519 C6
## # ... with 39 more variables: population_density <dbl>, make <fct>,
## # segment <fct>, model <fct>, fuel_type <fct>, max_torque <fct>,
## # max_power <fct>, engine_type <fct>, airbags <fct>, is_esc <fct>,
## # is_adjustable_steering <fct>, is_tpms <fct>, is_parking_sensors <fct>,
## # is_parking_camera <fct>, rear_brakes_type <fct>, displacement <fct>,
## # cylinder <fct>, transmission_type <fct>, gear_box <fct>,
## # steering_type <fct>, turning_radius <fct>, length <fct>, width <fct>, ...
# glimpse and summary of the data
glimpse(trainData)
## Rows: 58,592
## Columns: 44
## $ policy_id <chr> "ID00001", "ID00002", "ID00003", "ID0~
## $ policy_tenure <dbl> 0.51587359, 0.67261851, 0.84111026, 0~
## $ age_of_car <dbl> 0.05, 0.02, 0.02, 0.11, 0.11, 0.07, 0~
## $ age_of_policyholder <dbl> 0.6442308, 0.3750000, 0.3846154, 0.43~
## $ area_cluster <fct> C1, C2, C3, C4, C5, C6, C7, C8, C7, C~
## $ population_density <dbl> 4990, 27003, 4076, 21622, 34738, 1305~
## $ make <fct> 1, 1, 1, 1, 2, 3, 4, 1, 3, 1, 1, 1, 1~
## $ segment <fct> A, A, A, C1, A, C2, B2, B2, C2, B2, A~
## $ model <fct> M1, M1, M1, M2, M3, M4, M5, M6, M4, M~
## $ fuel_type <fct> CNG, CNG, CNG, Petrol, Petrol, Diesel~
## $ max_torque <fct> 60Nm@3500rpm, 60Nm@3500rpm, 60Nm@3500~
## $ max_power <fct> 40.36bhp@6000rpm, 40.36bhp@6000rpm, 4~
## $ engine_type <fct> F8D Petrol Engine, F8D Petrol Engine,~
## $ airbags <fct> 2, 2, 2, 2, 2, 6, 2, 2, 6, 6, 2, 2, 2~
## $ is_esc <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_adjustable_steering <fct> No, No, No, Yes, No, Yes, Yes, Yes, Y~
## $ is_tpms <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_parking_sensors <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes~
## $ is_parking_camera <fct> No, No, No, Yes, Yes, Yes, No, No, Ye~
## $ rear_brakes_type <fct> Drum, Drum, Drum, Drum, Drum, Disc, D~
## $ displacement <fct> 796, 796, 796, 1197, 999, 1493, 1497,~
## $ cylinder <fct> 3, 3, 3, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4~
## $ transmission_type <fct> Manual, Manual, Manual, Automatic, Au~
## $ gear_box <fct> 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5~
## $ steering_type <fct> Power, Power, Power, Electric, Electr~
## $ turning_radius <fct> 4.6, 4.6, 4.6, 4.8, 5, 5.2, 5, 4.8, 5~
## $ length <fct> 3445, 3445, 3445, 3995, 3731, 4300, 3~
## $ width <fct> 1515, 1515, 1515, 1735, 1579, 1790, 1~
## $ height <fct> 1475, 1475, 1475, 1515, 1490, 1635, 1~
## $ gross_weight <fct> 1185, 1185, 1185, 1335, 1155, 1720, 1~
## $ is_front_fog_lights <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_rear_window_wiper <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_washer <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_defogger <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_brake_assist <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_power_door_locks <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_central_locking <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_power_steering <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ is_driver_seat_height_adjustable <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_day_night_rear_view_mirror <fct> No, No, No, Yes, Yes, No, No, Yes, No~
## $ is_ecw <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_speed_alert <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ ncap_rating <fct> 0, 0, 0, 2, 2, 3, 5, 2, 3, 0, 0, 2, 2~
## $ is_claim <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
summary(trainData)
## policy_id policy_tenure age_of_car age_of_policyholder
## Length:58592 Min. :0.002735 Min. :0.00000 Min. :0.2885
## Class :character 1st Qu.:0.210250 1st Qu.:0.02000 1st Qu.:0.3654
## Mode :character Median :0.573792 Median :0.06000 Median :0.4519
## Mean :0.611246 Mean :0.06942 Mean :0.4694
## 3rd Qu.:1.039104 3rd Qu.:0.11000 3rd Qu.:0.5481
## Max. :1.396641 Max. :1.00000 Max. :1.0000
##
## area_cluster population_density make segment model
## C8 :13654 Min. : 290 1:38126 A :17321 M1 :14948
## C2 : 7342 1st Qu.: 6112 2: 2373 C1 : 3557 M4 :14018
## C5 : 6979 Median : 8794 3:14018 C2 :14018 M6 :13776
## C3 : 6101 Mean :18827 4: 1961 B2 :18314 M8 : 4173
## C14 : 3660 3rd Qu.:27003 5: 2114 B1 : 4173 M7 : 2940
## C13 : 3423 Max. :73430 Utility: 1209 M3 : 2373
## (Other):17433 (Other): 6364
## fuel_type max_torque max_power
## CNG :20330 113Nm@4400rpm :17796 88.50bhp@6000rpm :17796
## Petrol:20532 60Nm@3500rpm :14948 40.36bhp@6000rpm :14948
## Diesel:17730 250Nm@2750rpm :14018 113.45bhp@4000rpm:14018
## 82.1Nm@3400rpm: 4173 55.92bhp@5300rpm : 4173
## 91Nm@4250rpm : 2373 67.06bhp@5500rpm : 2373
## 200Nm@1750rpm : 2114 97.89bhp@3600rpm : 2114
## (Other) : 3170 (Other) : 3170
## engine_type airbags is_esc is_adjustable_steering
## F8D Petrol Engine :14948 2:40425 No :40191 No :23066
## 1.5 L U2 CRDi :14018 6:16958 Yes:18401 Yes:35526
## K Series Dual jet :13776 1: 1209
## K10C : 4173
## 1.2 L K Series Engine: 2940
## 1.0 SCe : 2373
## (Other) : 6364
## is_tpms is_parking_sensors is_parking_camera rear_brakes_type
## No :44574 Yes:56219 No :35704 Drum:44574
## Yes:14018 No : 2373 Yes:22888 Disc:14018
##
##
##
##
##
## displacement cylinder transmission_type gear_box steering_type
## 1197 :17796 3:21857 Manual :38181 5:44211 Power :33502
## 796 :14948 4:36735 Automatic:20411 6:14381 Electric:23881
## 1493 :14018 Manual : 1209
## 998 : 4173
## 999 : 2373
## 1498 : 2114
## (Other): 3170
## turning_radius length width height
## 4.6 :14948 3445 :14948 1515 :14948 1475 :14948
## 4.8 :14856 4300 :14018 1735 :14856 1635 :14018
## 5.2 :14018 3845 :13776 1790 :14018 1530 :13776
## 4.7 : 4173 3990 : 4538 1620 : 4173 1675 : 4173
## 5 : 3971 3655 : 4173 1745 : 2940 1500 : 2940
## 4.85 : 2940 3995 : 3194 1579 : 2373 1490 : 2373
## (Other): 3686 (Other): 3945 (Other): 5284 (Other): 6364
## gross_weight is_front_fog_lights is_rear_window_wiper is_rear_window_washer
## 1185 :14948 No :24664 No :41634 No :41634
## 1335 :14856 Yes:33928 Yes:16958 Yes:16958
## 1720 :14018
## 1340 : 4173
## 1410 : 2940
## 1155 : 2373
## (Other): 5284
## is_rear_window_defogger is_brake_assist is_power_door_locks is_central_locking
## No :38077 No :26415 No :16157 No :16157
## Yes:20515 Yes:32177 Yes:42435 Yes:42435
##
##
##
##
##
## is_power_steering is_driver_seat_height_adjustable
## Yes:57383 No :24301
## No : 1209 Yes:34291
##
##
##
##
##
## is_day_night_rear_view_mirror is_ecw is_speed_alert ncap_rating is_claim
## No :36309 No :16157 Yes:58229 0:19097 0:54844
## Yes:22283 Yes:42435 No : 363 2:21402 1: 3748
## 3:14018
## 5: 1961
## 4: 2114
##
##
plot_missing(trainData)
#hist(trainData$is_claim)
From our analysis it is clear that there are no missing data.
Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:
skimr::skim(trainData)
| Name | trainData |
| Number of rows | 58592 |
| Number of columns | 44 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 39 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| policy_id | 0 | 1 | 7 | 7 | 0 | 58592 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| area_cluster | 0 | 1 | FALSE | 22 | C8: 13654, C2: 7342, C5: 6979, C3: 6101 |
| make | 0 | 1 | FALSE | 5 | 1: 38126, 3: 14018, 2: 2373, 5: 2114 |
| segment | 0 | 1 | FALSE | 6 | B2: 18314, A: 17321, C2: 14018, B1: 4173 |
| model | 0 | 1 | FALSE | 11 | M1: 14948, M4: 14018, M6: 13776, M8: 4173 |
| fuel_type | 0 | 1 | FALSE | 3 | Pet: 20532, CNG: 20330, Die: 17730 |
| max_torque | 0 | 1 | FALSE | 9 | 113: 17796, 60N: 14948, 250: 14018, 82.: 4173 |
| max_power | 0 | 1 | FALSE | 9 | 88.: 17796, 40.: 14948, 113: 14018, 55.: 4173 |
| engine_type | 0 | 1 | FALSE | 11 | F8D: 14948, 1.5: 14018, K S: 13776, K10: 4173 |
| airbags | 0 | 1 | FALSE | 3 | 2: 40425, 6: 16958, 1: 1209 |
| is_esc | 0 | 1 | FALSE | 2 | No: 40191, Yes: 18401 |
| is_adjustable_steering | 0 | 1 | FALSE | 2 | Yes: 35526, No: 23066 |
| is_tpms | 0 | 1 | FALSE | 2 | No: 44574, Yes: 14018 |
| is_parking_sensors | 0 | 1 | FALSE | 2 | Yes: 56219, No: 2373 |
| is_parking_camera | 0 | 1 | FALSE | 2 | No: 35704, Yes: 22888 |
| rear_brakes_type | 0 | 1 | FALSE | 2 | Dru: 44574, Dis: 14018 |
| displacement | 0 | 1 | FALSE | 9 | 119: 17796, 796: 14948, 149: 14018, 998: 4173 |
| cylinder | 0 | 1 | FALSE | 2 | 4: 36735, 3: 21857 |
| transmission_type | 0 | 1 | FALSE | 2 | Man: 38181, Aut: 20411 |
| gear_box | 0 | 1 | FALSE | 2 | 5: 44211, 6: 14381 |
| steering_type | 0 | 1 | FALSE | 3 | Pow: 33502, Ele: 23881, Man: 1209 |
| turning_radius | 0 | 1 | FALSE | 9 | 4.6: 14948, 4.8: 14856, 5.2: 14018, 4.7: 4173 |
| length | 0 | 1 | FALSE | 9 | 344: 14948, 430: 14018, 384: 13776, 399: 4538 |
| width | 0 | 1 | FALSE | 10 | 151: 14948, 173: 14856, 179: 14018, 162: 4173 |
| height | 0 | 1 | FALSE | 11 | 147: 14948, 163: 14018, 153: 13776, 167: 4173 |
| gross_weight | 0 | 1 | FALSE | 10 | 118: 14948, 133: 14856, 172: 14018, 134: 4173 |
| is_front_fog_lights | 0 | 1 | FALSE | 2 | Yes: 33928, No: 24664 |
| is_rear_window_wiper | 0 | 1 | FALSE | 2 | No: 41634, Yes: 16958 |
| is_rear_window_washer | 0 | 1 | FALSE | 2 | No: 41634, Yes: 16958 |
| is_rear_window_defogger | 0 | 1 | FALSE | 2 | No: 38077, Yes: 20515 |
| is_brake_assist | 0 | 1 | FALSE | 2 | Yes: 32177, No: 26415 |
| is_power_door_locks | 0 | 1 | FALSE | 2 | Yes: 42435, No: 16157 |
| is_central_locking | 0 | 1 | FALSE | 2 | Yes: 42435, No: 16157 |
| is_power_steering | 0 | 1 | FALSE | 2 | Yes: 57383, No: 1209 |
| is_driver_seat_height_adjustable | 0 | 1 | FALSE | 2 | Yes: 34291, No: 24301 |
| is_day_night_rear_view_mirror | 0 | 1 | FALSE | 2 | No: 36309, Yes: 22283 |
| is_ecw | 0 | 1 | FALSE | 2 | Yes: 42435, No: 16157 |
| is_speed_alert | 0 | 1 | FALSE | 2 | Yes: 58229, No: 363 |
| ncap_rating | 0 | 1 | FALSE | 5 | 2: 21402, 0: 19097, 3: 14018, 4: 2114 |
| is_claim | 0 | 1 | FALSE | 2 | 0: 54844, 1: 3748 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| policy_tenure | 0 | 1 | 0.61 | 0.41 | 0.00 | 0.21 | 0.57 | 1.04 | 1.4 | ▇▅▃▆▃ |
| age_of_car | 0 | 1 | 0.07 | 0.06 | 0.00 | 0.02 | 0.06 | 0.11 | 1.0 | ▇▁▁▁▁ |
| age_of_policyholder | 0 | 1 | 0.47 | 0.12 | 0.29 | 0.37 | 0.45 | 0.55 | 1.0 | ▇▇▃▁▁ |
| population_density | 0 | 1 | 18826.86 | 17660.17 | 290.00 | 6112.00 | 8794.00 | 27003.00 | 73430.0 | ▇▃▂▁▁ |
trainData %>%
ggplot(aes(x = policy_tenure, fill = is_claim)) +
geom_boxplot()
trainData %>%
ggplot(aes(x = age_of_car, fill = is_claim)) +
geom_boxplot()
trainData %>%
ggplot(aes(x = age_of_policyholder, fill = is_claim)) +
geom_boxplot()
trainData %>%
ggplot(aes(x = population_density, fill = is_claim)) +
geom_boxplot()
Distribution of the data across categorical variable
long_df1 <- trainData %>% select(c('is_speed_alert','ncap_rating','is_claim')) %>% pivot_longer(cols=-c(is_claim), names_to='kpi')
ggplot(long_df1, aes(x=value, fill=is_claim)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) +
ggtitle('Comparing Categorical Features and Target')
long_df2 <- trainData %>% select(c('segment','area_cluster','is_claim')) %>% pivot_longer(cols=-c(is_claim), names_to='kpi')
ggplot(long_df2, aes(x=value, fill=is_claim)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) + ggtitle('Comparing Categorical Features and Target')
long_df3 <- trainData %>% select(c('model','is_parking_camera','is_claim')) %>% pivot_longer(cols=-c(is_claim), names_to='kpi')
We are considering only major categorical variable for the analysis
As per our analysis , we could see only a subset of features are relevant for our analysis
trainData2 <- trainData %>% select (policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,max_torque,
is_parking_sensors,is_parking_camera,is_brake_assist,is_power_door_locks,is_central_locking,is_speed_alert,ncap_rating,is_claim )
Target Variable Distribution
Let’s visualize our target distribution.
# Bar plot for target (Heart disease)
ggplot(trainData, aes(x=is_claim, fill=is_claim)) +
geom_bar() +
xlab("Insurance Claim") +
ylab("Count") +
ggtitle("Analysis of Insurance Claim") +
scale_fill_discrete(name = "Insurance Claim", labels = c("Absence", "Presence"))
round(prop.table(table(select(trainData, 'is_claim'))),2)
##
## 0 1
## 0.94 0.06
According to our analysis we could see there is a class imbalance present in this dataset. Imbalanced data can prove to be quite problematic in classifying the minority class. I will conduct oversampling using SMOTE before modeling the data to give the model a better chance at classifying the minority class.
One way to address this imbalance problem is to use Synthetic Minority Oversampling Technique, often abbreviated SMOTE. This technique involves creating a new dataset by oversampling observations from the minority class, which produces a dataset that has more balanced classes.
As per our analysis , we could see only a subset of features are relevant for our analysis
trainData2 <- trainData %>% select (policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,max_torque,
is_parking_sensors,is_parking_camera,is_brake_assist,is_power_door_locks,is_central_locking,is_speed_alert,ncap_rating,is_claim )
Our first step will be to separate the data into a training and test dataset. This way, we can test the accuracy by using our holdout test dataset later on. We decided to perform a 80/20 training to testing data split on our dataset.
# Splitting the data 80/20
set.seed(1234)
training.samples <- trainData2 $is_claim %>%
createDataPartition(p = 0.8, list=FALSE)
train.data <- trainData2 [training.samples,]
test.data <- trainData[-training.samples,]
round(prop.table(table(select(train.data, 'is_claim'))),2)
##
## 0 1
## 0.94 0.06
round(prop.table(table(select(test.data, 'is_claim'))),2)
##
## 0 1
## 0.94 0.06
We could see that the target variable is uniformly distributed both in test and train data in accordance with original data.
Since there is a class imbalance of the target variable , we will solve this by oversampling the minority class.
set.seed(12345)
train.balanced <- smote(is_claim ~ ., data = train.data, perc.over = 1)
train.balanced %>% ggplot(aes(is_claim)) +
geom_bar(fill = "#04354F") +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
DT_modelSub1 <- rpart(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist, data = train.balanced, method = 'class')
rpart.plot(DT_modelSub1)
predSub1 <-predict(DT_modelSub1, test.data, type="class")
dtSub1CM <- confusionMatrix(data = predSub1, reference = test.data$is_claim, positive = "1")
dtSub1CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5565 226
## 1 5403 523
##
## Accuracy : 0.5196
## 95% CI : (0.5105, 0.5287)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0487
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.69826
## Specificity : 0.50739
## Pos Pred Value : 0.08826
## Neg Pred Value : 0.96097
## Prevalence : 0.06392
## Detection Rate : 0.04464
## Detection Prevalence : 0.50576
## Balanced Accuracy : 0.60282
##
## 'Positive' Class : 1
##
Measure the performance
We could see that the performance has improved when we are considering only few relevant features and decision tree model is considering one more additional parameter is_break_assist.
Now the Accuracy is 51.9% and Sensitivity is 69.82% and Specificity is 50.73%
set.seed(123)
fit.forest <- randomForest(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist, data = train.balanced, importance=TRUE, ntree=200)
# display model details
fit.forest
##
## Call:
## randomForest(formula = is_claim ~ age_of_car + age_of_policyholder + population_density + policy_tenure + is_parking_sensors + is_power_door_locks + ncap_rating + is_brake_assist, data = train.balanced, importance = TRUE, ntree = 200)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 31.9%
## Confusion matrix:
## 0 1 class.error
## 0 3468 2530 0.4218073
## 1 1297 4701 0.2162387
rf.pred <- predict(fit.forest, newdata=test.data, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, test.data$is_claim))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6258 280
## 1 4710 469
##
## Accuracy : 0.5741
## 95% CI : (0.5651, 0.5831)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0524
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.57057
## Specificity : 0.62617
## Pos Pred Value : 0.95717
## Neg Pred Value : 0.09056
## Prevalence : 0.93608
## Detection Rate : 0.53410
## Detection Prevalence : 0.55799
## Balanced Accuracy : 0.59837
##
## 'Positive' Class : 0
##
We could see that Accuracy of random forest model is 57.4% , Sensitivity is 57.05% and Specificity is 62.6% which is better than the Decision Tree
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane.
SVM algorithms use a group of mathematical functions that are known as kernels. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions
Basically Kernel represents the style of SVM that is used to classify data. Here we will apply different kernal functions and then we will verify which one give more accurate result.
classifier1 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist,
data = train.balanced,
type = 'C-classification',
kernel = 'linear')
predSvm1 <- predict(classifier1, newdata = test.data)
result <- table(test.data$is_claim, predSvm1)
cmSvm1 <- confusionMatrix(data = predSvm1, reference = test.data$is_claim, positive = "1")
cmSvm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5877 298
## 1 5091 451
##
## Accuracy : 0.5401
## 95% CI : (0.531, 0.5491)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0347
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60214
## Specificity : 0.53583
## Pos Pred Value : 0.08138
## Neg Pred Value : 0.95174
## Prevalence : 0.06392
## Detection Rate : 0.03849
## Detection Prevalence : 0.47299
## Balanced Accuracy : 0.56898
##
## 'Positive' Class : 1
##
We will now apply few other popular kernels like radial,polynomial and sigmoid and then compare our previous results against linear kernal.
classifier2 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist,
data = train.balanced,
type = 'C-classification',
kernel = 'radial')
predSvm2 <- predict(classifier2, newdata = test.data)
cmSvm2 <- confusionMatrix(data = predSvm2, reference = test.data$is_claim, positive = "1")
cmSvm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4815 190
## 1 6153 559
##
## Accuracy : 0.4586
## 95% CI : (0.4496, 0.4677)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0394
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.74633
## Specificity : 0.43900
## Pos Pred Value : 0.08328
## Neg Pred Value : 0.96204
## Prevalence : 0.06392
## Detection Rate : 0.04771
## Detection Prevalence : 0.57284
## Balanced Accuracy : 0.59267
##
## 'Positive' Class : 1
##
classifier3 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist,
data = train.balanced,
type = 'C-classification',
kernel = 'polynomial')
predSvm3 <- predict(classifier3, newdata = test.data)
cmSvm3 <- confusionMatrix(data = predSvm3, reference = test.data$is_claim, positive = "1")
cmSvm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3943 147
## 1 7025 602
##
## Accuracy : 0.3879
## 95% CI : (0.3791, 0.3968)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0309
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.80374
## Specificity : 0.35950
## Pos Pred Value : 0.07893
## Neg Pred Value : 0.96406
## Prevalence : 0.06392
## Detection Rate : 0.05138
## Detection Prevalence : 0.65093
## Balanced Accuracy : 0.58162
##
## 'Positive' Class : 1
##
classifier4 <- svm(formula = is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist,
data = train.balanced,
type = 'C-classification',
kernel = 'sigmoid')
predSvm4 <- predict(classifier4, newdata = test.data)
cmSvm4 <- confusionMatrix(data = predSvm4, reference = test.data$is_claim, positive = "1")
cmSvm4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5699 379
## 1 5269 370
##
## Accuracy : 0.518
## 95% CI : (0.5089, 0.5271)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0034
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.49399
## Specificity : 0.51960
## Pos Pred Value : 0.06561
## Neg Pred Value : 0.93764
## Prevalence : 0.06392
## Detection Rate : 0.03158
## Detection Prevalence : 0.48127
## Balanced Accuracy : 0.50680
##
## 'Positive' Class : 1
##
From our analysis we could see that out of 4 different SVM kernals we tried , linear kernal gives the best accuracy (Accuracy : 54.01 Sensitivity : 60.2% , Specificity : 53.58% ). But when we compare the results against Random Forest , we noticed that Random Forest has performed better (Accuracy : 57.41%,Sensitivity : 57.05%, Specificity :62.61%).
Also it indicates the kernal function couldn’t bring in any clear margin of separation between 2 classes to provide a higher accuracy.
So it is clear that there are some sort of linear separation exist between those classes , but that alone is not enough to provide a better accuracy. It seems like it is the presence of large number of categorical variable which makes the Random Forest better in this classification problem.
** Couple of academic research articles comparing the decision tree vs SVM.**
The below article uses Support Vector Machine and Decision tree to identify the prognosis of metformin poisoning in the United States by the analysis of National Poisoning Data System. According to their study they found that the accuracy of the SVM model in predicting the prognosis of metformin poisoning was higher than the DT model.
https://bmcpharmacoltoxicol.biomedcentral.com/articles/10.1186/s40360-022-00588-0
Another article which does a comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-319
According to this article they found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines.
So we can conclude that the accuracy of different classification algorithm varies based on the data set , noise, outliers, number of features and their type (numerical and categorical ) , volume of the data etc. Also we could notice that for certain class of algorith does a better job with certain type of data.
Which algorithm is recommended to get more accurate results?
SVM uses kernel functions to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem. Decision trees are better for categorical data and it deals colinearity better than SVM
Is it better for classification or regression scenarios?
Eventhough SVM can be used for both regression as well as classification problem , it is widely used for classification problem.