Introduction

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

For this project, we are using the dataset that was found on Kaggle. https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction.
The data is about an Insurance company that has provided Health Insurance to its customers in past year and is now interested in providing Vehicle Insurance to its policy holders.

Metadata

Variable Definition
id Unique ID for the customer
Gender Gender of the customer
Age Age of the customer
Driving_License 0 : Customer does not have DL, 1 : Customer already has DL
Region_Code Unique code for the region of the customer
Previously_Insured 1 : Customer already has Vehicle Insurance, 0 : Customer doesn’t have Vehicle Insurance
Vehicle_Age Age of the Vehicle
Vehicle_Damage 1 : Customer got his/her vehicle damaged in the past, 0 : Customer didn’t get his/her vehicle damaged in the past.
Annual_Premium The amount customer needs to pay as premium in the year
PolicySalesChannel Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
Vintage Number of Days, Customer has been associated with the company
Response 1 : Customer is interested, 0 : Customer is not interested

Objective:

To predict if an insurance policy holder would be interested to buy a vehicle insurance as well. Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

The aim of this project is to leverage the machine learning algorithms such as Logistic Regression and Random Forest to create a predictive model using statistically significant variables from the given data set.

Model accuracy will be assessed using different techniques such as ROC (Receiver operating characteristic), AUC (Area under the ROC curve) and Confusion Matrix.

Implementation

ins_data = read.csv("vehicle_insurance.csv")

head(ins_data)
##   id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age
## 1  1   Male  44               1          28                  0   > 2 Years
## 2  2   Male  76               1           3                  0    1-2 Year
## 3  3   Male  47               1          28                  0   > 2 Years
## 4  4   Male  21               1          11                  1    < 1 Year
## 5  5 Female  29               1          41                  1    < 1 Year
## 6  6 Female  24               1          33                  0    < 1 Year
##   Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
## 1            Yes          40454                   26     217        1
## 2             No          33536                   26     183        0
## 3            Yes          38294                   26      27        1
## 4             No          28619                  152     203        0
## 5             No          27496                  152      39        0
## 6            Yes           2630                  160     176        0
str(ins_data)
## 'data.frame':    381109 obs. of  12 variables:
##  $ id                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender              : chr  "Male" "Male" "Male" "Male" ...
##  $ Age                 : int  44 76 47 21 29 24 23 56 24 32 ...
##  $ Driving_License     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Region_Code         : num  28 3 28 11 41 33 11 28 3 6 ...
##  $ Previously_Insured  : int  0 0 0 1 1 0 0 0 1 1 ...
##  $ Vehicle_Age         : chr  "> 2 Years" "1-2 Year" "> 2 Years" "< 1 Year" ...
##  $ Vehicle_Damage      : chr  "Yes" "No" "Yes" "No" ...
##  $ Annual_Premium      : num  40454 33536 38294 28619 27496 ...
##  $ Policy_Sales_Channel: num  26 26 26 152 152 160 152 26 152 152 ...
##  $ Vintage             : int  217 183 27 203 39 176 249 72 28 80 ...
##  $ Response            : int  1 0 1 0 0 0 0 1 0 0 ...
summary(ins_data)
##        id            Gender               Age        Driving_License 
##  Min.   :     1   Length:381109      Min.   :20.00   Min.   :0.0000  
##  1st Qu.: 95278   Class :character   1st Qu.:25.00   1st Qu.:1.0000  
##  Median :190555   Mode  :character   Median :36.00   Median :1.0000  
##  Mean   :190555                      Mean   :38.82   Mean   :0.9979  
##  3rd Qu.:285832                      3rd Qu.:49.00   3rd Qu.:1.0000  
##  Max.   :381109                      Max.   :85.00   Max.   :1.0000  
##   Region_Code    Previously_Insured Vehicle_Age        Vehicle_Damage    
##  Min.   : 0.00   Min.   :0.0000     Length:381109      Length:381109     
##  1st Qu.:15.00   1st Qu.:0.0000     Class :character   Class :character  
##  Median :28.00   Median :0.0000     Mode  :character   Mode  :character  
##  Mean   :26.39   Mean   :0.4582                                          
##  3rd Qu.:35.00   3rd Qu.:1.0000                                          
##  Max.   :52.00   Max.   :1.0000                                          
##  Annual_Premium   Policy_Sales_Channel    Vintage         Response     
##  Min.   :  2630   Min.   :  1          Min.   : 10.0   Min.   :0.0000  
##  1st Qu.: 24405   1st Qu.: 29          1st Qu.: 82.0   1st Qu.:0.0000  
##  Median : 31669   Median :133          Median :154.0   Median :0.0000  
##  Mean   : 30564   Mean   :112          Mean   :154.3   Mean   :0.1226  
##  3rd Qu.: 39400   3rd Qu.:152          3rd Qu.:227.0   3rd Qu.:0.0000  
##  Max.   :540165   Max.   :163          Max.   :299.0   Max.   :1.0000
sum(is.na(ins_data$Gender))
## [1] 0
sum(is.na(ins_data$Vehicle_Damage))
## [1] 0
sum(is.na(ins_data$Vehicle_Age))
## [1] 0
nrow(ins_data)
## [1] 381109
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)

Data Manipulation

Count_Sales_Channel <- ins_data %>% group_by(Policy_Sales_Channel) %>% summarise(count = n()) %>% arrange(count)
Count_Sales_Channel
## # A tibble: 155 x 2
##    Policy_Sales_Channel count
##                   <dbl> <int>
##  1                   41     1
##  2                   43     1
##  3                   84     1
##  4                  123     1
##  5                  143     1
##  6                  144     1
##  7                  149     1
##  8                   50     2
##  9                   74     2
## 10                   75     2
## # ... with 145 more rows
mean_AP <- ins_data %>% group_by(Vehicle_Damage) %>% summarise(mean = mean(Annual_Premium))
mean_AP
## # A tibble: 2 x 2
##   Vehicle_Damage   mean
## * <chr>           <dbl>
## 1 No             30402.
## 2 Yes            30724.

Data Visualizations

library(ggthemes)

Age_Premium_Plot <- ggplot(data = ins_data) + geom_point(mapping = aes(x = Age, y = Annual_Premium)) + labs(x="Age",y="Annual Premium") + theme_wsj() + ggtitle("Age vs Annual Premium")
Age_Premium_Plot

Vehicle damage based on Age

ggplot(ins_data, aes(Age,fill=Vehicle_Damage))+ geom_bar(stat="count",position='dodge', width = 0.5)+ ggtitle("Age vs Vehicle Damage")

  • The proportion of vehicles that are damaged vs non-damaged is lower in the range in the range of 20-30 age group whereas, for age group 45-80, the proportion of damaged vehicles is more than non-damaged ones.

Vehicle damage based on Gender

ggplot(ins_data, aes(Gender,fill=Vehicle_Damage))+ geom_bar(stat="count",position='dodge', width = 0.5) + labs(title="Gender vs Vehicle Damage") + theme_bw() + scale_fill_brewer()

  • The proportion of females who have damaged vehicles is less when compared to that of men.

Response count vs Vehicle Damage

ggplot(ins_data, mapping = aes(x = Response, y = Vehicle_Damage))+ theme_stata() + ggtitle("Response count vs Vehicle Damage") + geom_col()

  • From the above graph we can observe that those having a vehicle damage tend to subscribe more for a vehicle insurance.

Response count vs Vehicle Age

ggplot(ins_data, aes(Response,fill=Vehicle_Age))+ geom_bar(stat="count",position='dodge', width = 0.5) +labs(title="Vehicle Age vs Response count") + theme_bw()+ scale_fill_brewer(palette="Set1")

  • From the above graph we can observe that majority of new vehicle owners i.e. < 1 year are not interested in a vehicle insurance.
  • Also, a significant number of vehicle owners i.e. with vehicle age 1-2 year are also not interested in a vehicle insurance.
  • From the group who are interested in a vehicle insurance (Response=1), majority of them have vehicle age as 1-2 year.

Creating a dataframe using SQLDF package

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
ins_data_copy <- sqldf("select * from ins_data")

#Control Structures
ins_data_copy$Gender <- ifelse(ins_data_copy$Gender == "Male", 1, 0)
ins_data_copy$Vehicle_Damage <- ifelse(ins_data_copy$Vehicle_Damage == "Yes", 1, 0)
ins_data_copy$Vehicle_Age <- ifelse(ins_data_copy$Vehicle_Age == "> 2 Years", 2, ifelse(ins_data_copy$Vehicle_Age == "1-2 Year", 1, 0))
str(ins_data_copy)
## 'data.frame':    381109 obs. of  12 variables:
##  $ id                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender              : num  1 1 1 1 0 0 1 0 0 0 ...
##  $ Age                 : int  44 76 47 21 29 24 23 56 24 32 ...
##  $ Driving_License     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Region_Code         : num  28 3 28 11 41 33 11 28 3 6 ...
##  $ Previously_Insured  : int  0 0 0 1 1 0 0 0 1 1 ...
##  $ Vehicle_Age         : num  2 1 2 0 0 0 0 1 0 0 ...
##  $ Vehicle_Damage      : num  1 0 1 0 0 1 1 1 0 0 ...
##  $ Annual_Premium      : num  40454 33536 38294 28619 27496 ...
##  $ Policy_Sales_Channel: num  26 26 26 152 152 160 152 26 152 152 ...
##  $ Vintage             : int  217 183 27 203 39 176 249 72 28 80 ...
##  $ Response            : int  1 0 1 0 0 0 0 1 0 0 ...

Correlation Matrix

cor(ins_data_copy)
##                                 id        Gender          Age Driving_License
## id                    1.0000000000  0.0009991434  0.001561212   -0.0005635246
## Gender                0.0009991434  1.0000000000  0.145545034   -0.0183738198
## Age                   0.0015612125  0.1455450343  1.000000000   -0.0797819773
## Driving_License      -0.0005635246 -0.0183738198 -0.079781977    1.0000000000
## Region_Code          -0.0005721027  0.0006041786  0.042573968   -0.0010808797
## Previously_Insured    0.0024570988 -0.0819321594 -0.254682388    0.0149694154
## Vehicle_Age           0.0008781725  0.1604271445  0.765790101   -0.0373057135
## Vehicle_Damage       -0.0015074826  0.0916059179  0.267534315   -0.0166220166
## Annual_Premium        0.0030268195  0.0036727445  0.067507002   -0.0119064804
## Policy_Sales_Channel -0.0028370302 -0.1111588141 -0.577825526    0.0437305491
## Vintage              -0.0006302473 -0.0025168966 -0.001264079   -0.0008480492
## Response             -0.0013681502  0.0524399138  0.111146895    0.0101551746
##                        Region_Code Previously_Insured   Vehicle_Age
## id                   -0.0005721027        0.002457099  0.0008781725
## Gender                0.0006041786       -0.081932159  0.1604271445
## Age                   0.0425739676       -0.254682388  0.7657901011
## Driving_License      -0.0010808797        0.014969415 -0.0373057135
## Region_Code           1.0000000000       -0.024658821  0.0437782890
## Previously_Insured   -0.0246588212        1.000000000 -0.3808731921
## Vehicle_Age           0.0437782890       -0.380873192  1.0000000000
## Vehicle_Damage        0.0282350121       -0.824142665  0.3968729148
## Annual_Premium       -0.0105875101        0.004268759  0.0415803272
## Policy_Sales_Channel -0.0424202424        0.219381069 -0.5506627965
## Vintage              -0.0027496311        0.002536787 -0.0018922251
## Response              0.0105698556       -0.341170463  0.2218739872
##                      Vehicle_Damage Annual_Premium Policy_Sales_Channel
## id                     -0.001507483   0.0030268195        -2.837030e-03
## Gender                  0.091605918   0.0036727445        -1.111588e-01
## Age                     0.267534315   0.0675070016        -5.778255e-01
## Driving_License        -0.016622017  -0.0119064804         4.373055e-02
## Region_Code             0.028235012  -0.0105875101        -4.242024e-02
## Previously_Insured     -0.824142665   0.0042687594         2.193811e-01
## Vehicle_Age             0.396872915   0.0415803272        -5.506628e-01
## Vehicle_Damage          1.000000000   0.0093492910        -2.243769e-01
## Annual_Premium          0.009349291   1.0000000000        -1.132468e-01
## Policy_Sales_Channel   -0.224376850  -0.1132468486         1.000000e+00
## Vintage                -0.002064373  -0.0006084172         1.849942e-06
## Response                0.354399544   0.0225746955        -1.390415e-01
##                            Vintage     Response
## id                   -6.302473e-04 -0.001368150
## Gender               -2.516897e-03  0.052439914
## Age                  -1.264079e-03  0.111146895
## Driving_License      -8.480492e-04  0.010155175
## Region_Code          -2.749631e-03  0.010569856
## Previously_Insured    2.536787e-03 -0.341170463
## Vehicle_Age          -1.892225e-03  0.221873987
## Vehicle_Damage       -2.064373e-03  0.354399544
## Annual_Premium       -6.084172e-04  0.022574696
## Policy_Sales_Channel  1.849942e-06 -0.139041501
## Vintage               1.000000e+00 -0.001050372
## Response             -1.050372e-03  1.000000000
  • From the above correlation matrix, we get to know that Response variable has better relationship with Vehicle Damage and Vehicle Age.

Logistic Regression

Data division into Train and Test

set.seed(123)
train_idx <- sample(nrow(ins_data), .70*nrow(ins_data))

ins_train <- ins_data[train_idx,]
ins_test <- ins_data[-train_idx,]

Fitting logistic regression on all the variables

ins_logreg <- glm(Response ~.,family=binomial,data = ins_train)
summary(ins_logreg)
## 
## Call:
## glm(formula = Response ~ ., family = binomial, data = ins_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3528  -0.6234  -0.0451  -0.0293   3.9813  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.153e+00  2.125e-01 -19.544  < 2e-16 ***
## id                   -6.881e-08  5.877e-08  -1.171    0.242    
## GenderMale            8.999e-02  1.331e-02   6.763 1.35e-11 ***
## Age                  -2.555e-02  6.521e-04 -39.177  < 2e-16 ***
## Driving_License       1.321e+00  2.044e-01   6.462 1.03e-10 ***
## Region_Code           2.453e-04  5.212e-04   0.471    0.638    
## Previously_Insured   -4.004e+00  9.984e-02 -40.103  < 2e-16 ***
## Vehicle_Age> 2 Years  1.387e+00  3.213e-02  43.174  < 2e-16 ***
## Vehicle_Age1-2 Year   1.175e+00  2.247e-02  52.294  < 2e-16 ***
## Vehicle_DamageYes     2.016e+00  4.068e-02  49.568  < 2e-16 ***
## Annual_Premium        2.341e-06  3.542e-07   6.609 3.86e-11 ***
## Policy_Sales_Channel -2.415e-03  1.309e-04 -18.445  < 2e-16 ***
## Vintage              -7.412e-05  7.735e-05  -0.958    0.338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198057  on 266775  degrees of freedom
## Residual deviance: 146383  on 266763  degrees of freedom
## AIC: 146409
## 
## Number of Fisher Scoring iterations: 9

The p value for id, Region Code and Vintage are not statistically significant so we remove them and then fit the logistic regression.

ins_logreg_final <- glm(Response ~ Gender+Age+Driving_License+Previously_Insured
                  +Vehicle_Age+Vehicle_Damage+Annual_Premium+Policy_Sales_Channel,family=binomial,data = ins_train)
summary(ins_logreg_final)
## 
## Call:
## glm(formula = Response ~ Gender + Age + Driving_License + Previously_Insured + 
##     Vehicle_Age + Vehicle_Damage + Annual_Premium + Policy_Sales_Channel, 
##     family = binomial, data = ins_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3479  -0.6231  -0.0451  -0.0294   3.9827  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.172e+00  2.114e-01 -19.733  < 2e-16 ***
## GenderMale            8.992e-02  1.331e-02   6.758 1.40e-11 ***
## Age                  -2.554e-02  6.521e-04 -39.173  < 2e-16 ***
## Driving_License       1.322e+00  2.044e-01   6.466 1.00e-10 ***
## Previously_Insured   -4.004e+00  9.984e-02 -40.104  < 2e-16 ***
## Vehicle_Age> 2 Years  1.387e+00  3.213e-02  43.174  < 2e-16 ***
## Vehicle_Age1-2 Year   1.175e+00  2.247e-02  52.296  < 2e-16 ***
## Vehicle_DamageYes     2.016e+00  4.068e-02  49.572  < 2e-16 ***
## Annual_Premium        2.339e-06  3.542e-07   6.603 4.03e-11 ***
## Policy_Sales_Channel -2.416e-03  1.309e-04 -18.461  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 198057  on 266775  degrees of freedom
## Residual deviance: 146386  on 266766  degrees of freedom
## AIC: 146406
## 
## Number of Fisher Scoring iterations: 9

Histogram of predicted probability

testing_ins <- predict(ins_logreg_final, ins_test, type = "response")
hist(testing_ins)

  • We can see from the above histogram that the predicted probability threshold for predicting whether an individual will buy a vehicle insurance needs to be less than 0.50, as no predicted probability is greater than ~0.45. For this case study, we will predict that any individual with a predicted response probability greater than 0.25 is predicted as a buyer.

Confusion Matrix

Y_hat_mod2 <- as.numeric(testing_ins > 0.25)
table(ins_test$Response, Y_hat_mod2, dnn = c("Actual", "Predicted"))
##       Predicted
## Actual     0     1
##      0 80122 20090
##      1  4688  9433

Checking the accuracy

accuracy <- mean(ins_test$Response == Y_hat_mod2)
accuracy
## [1] 0.7832822
  • The accuracy of logistic regression model we used came out to be 78.33%.

We explored other classification algorithms to compare the accuracy of our logistic regression model. Therefore, we will create another classification model using Random Forest and then compare these 2 models in order to select the better one for our problem statement.

Random Forest

library(stats)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
ins_train$Response <- as.factor(ins_train$Response)
ins_test$Response <- as.factor(ins_test$Response)
set.seed(500)
RF_Model = randomForest(Response~.,data=ins_train,ntree=100)
predicted_response = predict(RF_Model,ins_test)

ins_test$predicted_response = predicted_response

Confusion Matrix for Random Forest

CFM1 = table(ins_test$Response,ins_test$predicted_response)
CFM1
##    
##          0      1
##   0 100077    135
##   1  13972    149

Accuracy for Random Forest

Accuracy_Random_Forest = sum(diag(CFM1)/sum(CFM1))
Accuracy_Random_Forest
## [1] 0.8766148
  • The Random Forest model accuracy came out be 87.66%, which is more than 78.33% obtained from Logistic Regression.

ROC and AUC Statistics

In the above logistics regression, we found 0.25 to be the probability threshold for determining the classification by mere visual inspection. However, for deciding the probability threshold in a better way, we used ROC and AUC curves in addition.

#ROC and AUC 

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
par(pty ="s")
auc_lm <- roc(ins_train$Response,ins_logreg_final$fitted.values, plot = TRUE, legacy.axes= TRUE, 
    percent = TRUE, xlab = "False Positive %", ylab = "True Positive %", 
    col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc.df <- data.frame(
  tpp = auc_lm$sensitivities*100,
  fpp = (1 - auc_lm$specificities)*100,
  thresholds = auc_lm$thresholds
)

tradeoff <- roc.df[roc.df$tpp >80 & roc.df$tpp <100, ]
head(tradeoff)
##             tpp       fpp thresholds
## 222975 99.72690 -9867.163  0.4063777
## 222976 99.72690 -9867.206  0.4063973
## 222977 99.72690 -9867.248  0.4064185
## 222978 99.72690 -9867.291  0.4064446
## 222979 99.72690 -9867.334  0.4064728
## 222980 99.42005 -9867.334  0.4064869
  • On studying the ROC Curve, we could strike a trade off between TP rate and FP rate at that part of the curve where TP rate is between 80 and 100.
auc_rf <- roc(ins_train$Response,RF_Model$votes[,1], plot = TRUE, legacy.axes= TRUE, 
    percent = TRUE, xlab = "False Positive %", ylab = "True Positive %", 
    col="#4daf4a", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases

roc(ins_train$Response,ins_logreg_final$fitted.values, plot = TRUE, legacy.axes= TRUE, 
    percent = TRUE, xlab = "False Positive %", ylab = "True Positive %", 
    col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = ins_train$Response, predictor = ins_logreg_final$fitted.values,     percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive %",     ylab = "True Positive %", col = "#377eb8", lwd = 4, print.auc = TRUE)
## 
## Data: ins_logreg_final$fitted.values in 234187 controls (ins_train$Response 0) < 32589 cases (ins_train$Response 1).
## Area under the curve: 83.67%
plot.roc(ins_train$Response,RF_Model$votes[,1], percent = TRUE, 
         col="#4daf4a", lwd=4, print.auc = TRUE, add=TRUE, print.auc.y=40)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases

  • On investigating the ROC + AUC augmented graph, we observed that the AUC is more Logistic Regression. This implies that this algorithm has outperformed Logistic regression for the given sample chosen for the classification problem. However, this does not conclude that logistic regression is the optimal model for our problem since our dataset is highly imbalanced.

Hence, we are are using SMOTE that can overcome the challenge of class imbalance and then we will assess the models again on ROC AUC curve.

#SMOTE implementation

library('smotefamily')
library('DMwR')
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'DMwR'
## The following object is masked from 'package:smotefamily':
## 
##     SMOTE
ins_data$Response = as.factor(ins_data$Response)

ins_data$Gender <- ifelse(ins_data$Gender == "Male", 1, 0)
ins_data$Vehicle_Damage <- ifelse(ins_data$Vehicle_Damage == "Yes", 1, 0)
ins_data$Vehicle_Age <- ifelse(ins_data$Vehicle_Age == "> 2 Years", 2, ifelse(ins_data$Vehicle_Age == "1-2 Year", 1, 0))
set.seed(500)
trainSplit <- SMOTE(Response ~ ., ins_data, perc.over = 300, perc.under=200)
prop.table(table(trainSplit$Response))
## 
##   0   1 
## 0.6 0.4
  • SMOTE (Synthetic Minority Oversampling Technique) is used in the above data set to balance out the highly skewed proportions of 1s and 0s. Proportion for 1s was less than 15% in our dataset, hence we have up-sampled the # of 1s and down-sampled the # of 0s in order to achieve a balanced dataset on which classification can be performed.

  • Post application of SMOTE, we achieved a 60:40 ratio of 0s and 1s respectively.

Logistic regression model for balanced dataset

  • Now we will use the updated balanced dataset, which we obtained from running SMOTE on our original dataset, to create test and train sets. These updated train and test datasets will be further utilized in creating updated classification models.
set.seed(123)
train_idx <- sample(nrow(trainSplit), .70*nrow(trainSplit))

ins_train_sm <- trainSplit[train_idx,]
ins_test_sm <- trainSplit[-train_idx,]
ins_logreg_smote <- glm(Response ~ Gender+Age+Driving_License+Previously_Insured
                        +Vehicle_Age+Vehicle_Damage+Annual_Premium+Policy_Sales_Channel,
                        family=binomial,data = ins_train_sm)
summary(ins_logreg_smote)
## 
## Call:
## glm(formula = Response ~ Gender + Age + Driving_License + Previously_Insured + 
##     Vehicle_Age + Vehicle_Damage + Annual_Premium + Policy_Sales_Channel, 
##     family = binomial, data = ins_train_sm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0788  -0.6873  -0.0711   0.9250   3.5101  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.453e+00  1.201e-01 -20.418  < 2e-16 ***
## Gender                1.048e-01  9.124e-03  11.488  < 2e-16 ***
## Age                  -2.163e-02  4.526e-04 -47.786  < 2e-16 ***
## Driving_License       1.328e+00  1.153e-01  11.517  < 2e-16 ***
## Previously_Insured   -3.983e+00  4.959e-02 -80.310  < 2e-16 ***
## Vehicle_Age           7.652e-01  1.162e-02  65.824  < 2e-16 ***
## Vehicle_Damage        2.012e+00  2.206e-02  91.172  < 2e-16 ***
## Annual_Premium        1.952e-06  2.683e-07   7.274 3.49e-13 ***
## Policy_Sales_Channel -2.868e-03  9.185e-05 -31.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 440139  on 326969  degrees of freedom
## Residual deviance: 289710  on 326961  degrees of freedom
## AIC: 289728
## 
## Number of Fisher Scoring iterations: 8
  • To check the probability threshold using which we can classify our data, we will create a prediction using type = “response”, which will create the probabilities instead of actual predicted values. Then we create a histogram and check which threshold we can take. As observed from the below histogram, we have taken 0.5 as the probability threshold.
testing_smote <- predict(ins_logreg_smote, ins_test_sm, type = "response")
hist(testing_smote)

  • Now we will create a confusion matrix to assess the accuracy of our updated logistic regression model.
Y_hat_mod2 <- as.numeric(testing_smote > 0.5)
ins_test_sm$predicted_values = Y_hat_mod2
table(ins_test_sm$Response, Y_hat_mod2, dnn = c("Actual", "Predicted"))
##       Predicted
## Actual     0     1
##      0 54439 29676
##      1  5387 50628
accuracy_sm <- mean(ins_test_sm$Response == Y_hat_mod2)
accuracy_sm
## [1] 0.7497823

Random Forest after applying SMOTE

set.seed(500)
RF_Model_sm = randomForest(Response~.,data=ins_train_sm, ntree=100)
predicted_response_sm = predict(RF_Model_sm,ins_test_sm)

ins_test_sm$predicted_response_sm = predicted_response_sm

Confusion Matrix for Random Forest

CFM2 = table(ins_test_sm$Response,ins_test_sm$predicted_response_sm)
CFM2
##    
##         0     1
##   0 72706 11409
##   1  6771 49244

Accuracy for Random Forest

Accuracy_Random_Forest_sm = sum(diag(CFM2)/sum(CFM2))
Accuracy_Random_Forest_sm
## [1] 0.8702633
  • To assess whether our updated model is giving better results, we create ROC AUC curve and check the area under curve. As we can observe, the AUC for Random Forest is greater than AUC for Logistic Regression. Therefore, we can conclude that Random Forest model accurately classifies our dataset.
auc_lm_sm <- roc(ins_train_sm$Response,ins_logreg_smote$fitted.values, plot = TRUE, legacy.axes= TRUE, 
    percent = TRUE, xlab = "False Positive %", ylab = "True Positive %", 
    col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

auc_rf_sm <- roc(ins_train_sm$Response,RF_Model_sm$votes[,1], plot = TRUE, legacy.axes= TRUE, 
    percent = TRUE, xlab = "False Positive %", ylab = "True Positive %", 
    col="#4daf4a", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases

roc(ins_train_sm$Response,ins_logreg_smote$fitted.values, plot = TRUE, legacy.axes= TRUE, 
    percent = TRUE, xlab = "False Positive %", ylab = "True Positive %", 
    col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = ins_train_sm$Response, predictor = ins_logreg_smote$fitted.values,     percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive %",     ylab = "True Positive %", col = "#377eb8", lwd = 4, print.auc = TRUE)
## 
## Data: ins_logreg_smote$fitted.values in 196145 controls (ins_train_sm$Response 0) < 130825 cases (ins_train_sm$Response 1).
## Area under the curve: 83.28%
plot.roc(ins_train_sm$Response,RF_Model_sm$votes[,1], percent = TRUE,col="#4daf4a", lwd=4, print.auc = TRUE, add=TRUE, print.auc.y=40)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases

  • We will now export the results to an external file i.e. the predicted values whether the customer would take vehicle insurance or not.
#Use function to write the file to a location
getwd()
## [1] "C:/Users/yamin/OneDrive/Desktop/Spring Sem 1 First half/BANA 6043 Statistical Computing/Final_Project_Group_2"
write_fn <- function(data,filename){
   write.csv(data,filename)
}

write_fn(ins_test_sm, "predicted_vehicle_insurance_data.csv")

Conclusion

  • The aim of our project was to classify whether customer will take vehicle insurance or not. We used dataset from kaggle and performed exploratory data analysis for descriptive statistics.

  • For classification, we created 2 classification models i.e. Logistic Regression and Random Forest and calculated accuracy and AUC statistics. We observed that Logistic Regression outperformed Random Forest. However, this result was due to the fact that our dataset was highly imbalanced with proportion of 1s in the response variable less than 15%.

  • To mitigate the impact of imbalanced dataset, we used SMOTE technique to achieve a 60:40 ratio in the response variable. We again created both classification models and assessed their accuracy. In this case, Random Forest outperformed Logistic Regression based on its accuracy and AUC score.

Models Metric Before SMOTE(%) After SMOTE (%)
Logistic Regression Accuracy 78.33 74.98
Logistic Regression AUC 83.67 83.28
Random Forest Accuracy 87.66 87.03
Random Forest AUC 82.4 94.97
  • The above table summarizes the model statistics before and after SMOTE for both the classification models. We can conclude that overall, Random Forest is a better model at classifying our dataset.

Appendix

ROC and AUC for assessing classification model performance at different probability thresholds

ROC (Receiver operating characterstic curve)

An ROC curve (receiver operating characteristic curve) is a graph, which shows the performance of a classification model at all classification thresholds (probability at which dependent variable is classified).

ROC curve plots two parameters:

  • True Positive Rate
  • False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

\[TPR=(TP)/(TP + FN)\]

False Positive Rate (FPR) is defined as follows:

\[FPR=(FP)/(FP + TN)\]

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve -

AUC (Area under curve)

AUC represents the measure of separability, which tells how much the classification model is capable of distinguishing between classes. Generally, higher the AUC, the better the model at predicting 1s as 1s and 0s as 0s.

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. For example - When AUC is 0.9, it means there is a 90% chance that the model will be able to distinguish between positive class and negative class.

To visually represent AUC, it is the area under the ROC curve. We can plot multiple ROC & AUC curve on same axis to compare the performance of different models.

Implementation of ROC and AUC in R

To implement ROC & AUC in R, we can use roc function in R which is present in the pROC library (part of pROC package).

roc function in R outputs the value of AUC as well and plots the curve in case plot attribute is equal to TRUE. We can also print the value of AUC on the curve itself and also plot multiple models on a single curve to compare the model performance.

Example of R code for ROC -

roc(dependent variable,model fitted values, plot = TRUE)

Decision Trees

Definition:

  • Decision Tree is a predictive modelling tool that are constructed via an algorithmic approach and that identifies ways to split a data set based on different conditions. It is one of the most widely used and practical methods for supervised learning. Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features

Deciding nodes for Decision Trees:

  • For generating best possible decision trees, the variables which act as nodes in the decision trees are decided by calculating the Information Gain of each independent variable of the dataset.

  • To define Information Gain precisely, we need to define a measure commonly used in information theory called entropy that measures the level of impurity in a group of examples. Mathematically, it is defined as:

\[\sum_{x = i}^{n} -p * log_2(p_i)\] \[p_i= Probability of class i\] + The following figure shows the form of the entropy function relative to a boolean classification as \(p_+\) varies between 0 and 1.

  • Now, given entropy as a measure of the impurity in a sample of training examples, we can now define information gain as a measure of the effectiveness of an attribute in classifying the training data. Information gain, Gain (S, A) of an attribute A, relative to a sample of examples S, is defined as:

\[Gain(S,A) = Entropy (S) - \sum_{x = v \epsilon values(A)} rac|S_v||S|. Entropy(S_v)\] + where Values(A) is the set of all possible values for attribute the A, and \(S_v\) is the subset of S for which attribute A has value v. Note the first term in the equation is just entropy of the original sample S, and the second term is the expected value of entropy after S is partitioned using attribute A, i.e. entropy of its children. Expected entropy described by this second term is simply the sum of entropies of each subset \(S_v\), weighted by the fraction of examples $ rac{|S_v|}{|S|}$ that belong to \(S_v\). Gain(S, A) is therefore the expected reduction in entropy caused by knowing the value of attribute A.

In short : \[ Information Gain = Entropy(parentnode) - Average Entropy (children)\]

Random Forests

Definition:

  • Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble

Algorithm

  • Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction

  • A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

  • The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction).

  • While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction. So the prerequisites for random forest to perform well are:

    • There needs to be some actual signal in our features so that models built using those features do better than random guessing.
    • The predictions (and therefore the errors) made by the individual trees need to have low correlations with each other.

R code example

\[RFModel = randomForest(Response \sim .,data=dataset)\] + Here the ~ . will regress Response varaible to all the predictor variables in the dataset

SMOTE for imbalanced classification

Definition:

  • SMOTE stands for Synthetic Minority Oversampling Technique

Working

  • SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

  • This procedure can be used to create as many synthetic examples for the minority class as are required. It suggests first using random undersampling to trim the number of examples in the majority class, then use SMOTE to oversample the minority class to balance the class distribution.

  • Example of SMOTE in R:

\[Balanced Data <- SMOTE(Response \sim ., trainSplit, perc.over = 100, perc.under= 100)\]