An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.
Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.
For this project, we are using the dataset that was found on Kaggle. https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction.
The data is about an Insurance company that has provided Health Insurance to its customers in past year and is now interested in providing Vehicle Insurance to its policy holders.
| Variable | Definition |
|---|---|
| id | Unique ID for the customer |
| Gender | Gender of the customer |
| Age | Age of the customer |
| Driving_License | 0 : Customer does not have DL, 1 : Customer already has DL |
| Region_Code | Unique code for the region of the customer |
| Previously_Insured | 1 : Customer already has Vehicle Insurance, 0 : Customer doesn’t have Vehicle Insurance |
| Vehicle_Age | Age of the Vehicle |
| Vehicle_Damage | 1 : Customer got his/her vehicle damaged in the past, 0 : Customer didn’t get his/her vehicle damaged in the past. |
| Annual_Premium | The amount customer needs to pay as premium in the year |
| PolicySalesChannel | Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. |
| Vintage | Number of Days, Customer has been associated with the company |
| Response | 1 : Customer is interested, 0 : Customer is not interested |
To predict if an insurance policy holder would be interested to buy a vehicle insurance as well. Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.
The aim of this project is to leverage the machine learning algorithms such as Logistic Regression and Random Forest to create a predictive model using statistically significant variables from the given data set.
Model accuracy will be assessed using different techniques such as ROC (Receiver operating characteristic), AUC (Area under the ROC curve) and Confusion Matrix.
ins_data = read.csv("vehicle_insurance.csv")
head(ins_data)
## id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age
## 1 1 Male 44 1 28 0 > 2 Years
## 2 2 Male 76 1 3 0 1-2 Year
## 3 3 Male 47 1 28 0 > 2 Years
## 4 4 Male 21 1 11 1 < 1 Year
## 5 5 Female 29 1 41 1 < 1 Year
## 6 6 Female 24 1 33 0 < 1 Year
## Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
## 1 Yes 40454 26 217 1
## 2 No 33536 26 183 0
## 3 Yes 38294 26 27 1
## 4 No 28619 152 203 0
## 5 No 27496 152 39 0
## 6 Yes 2630 160 176 0
str(ins_data)
## 'data.frame': 381109 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age : int 44 76 47 21 29 24 23 56 24 32 ...
## $ Driving_License : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Region_Code : num 28 3 28 11 41 33 11 28 3 6 ...
## $ Previously_Insured : int 0 0 0 1 1 0 0 0 1 1 ...
## $ Vehicle_Age : chr "> 2 Years" "1-2 Year" "> 2 Years" "< 1 Year" ...
## $ Vehicle_Damage : chr "Yes" "No" "Yes" "No" ...
## $ Annual_Premium : num 40454 33536 38294 28619 27496 ...
## $ Policy_Sales_Channel: num 26 26 26 152 152 160 152 26 152 152 ...
## $ Vintage : int 217 183 27 203 39 176 249 72 28 80 ...
## $ Response : int 1 0 1 0 0 0 0 1 0 0 ...
summary(ins_data)
## id Gender Age Driving_License
## Min. : 1 Length:381109 Min. :20.00 Min. :0.0000
## 1st Qu.: 95278 Class :character 1st Qu.:25.00 1st Qu.:1.0000
## Median :190555 Mode :character Median :36.00 Median :1.0000
## Mean :190555 Mean :38.82 Mean :0.9979
## 3rd Qu.:285832 3rd Qu.:49.00 3rd Qu.:1.0000
## Max. :381109 Max. :85.00 Max. :1.0000
## Region_Code Previously_Insured Vehicle_Age Vehicle_Damage
## Min. : 0.00 Min. :0.0000 Length:381109 Length:381109
## 1st Qu.:15.00 1st Qu.:0.0000 Class :character Class :character
## Median :28.00 Median :0.0000 Mode :character Mode :character
## Mean :26.39 Mean :0.4582
## 3rd Qu.:35.00 3rd Qu.:1.0000
## Max. :52.00 Max. :1.0000
## Annual_Premium Policy_Sales_Channel Vintage Response
## Min. : 2630 Min. : 1 Min. : 10.0 Min. :0.0000
## 1st Qu.: 24405 1st Qu.: 29 1st Qu.: 82.0 1st Qu.:0.0000
## Median : 31669 Median :133 Median :154.0 Median :0.0000
## Mean : 30564 Mean :112 Mean :154.3 Mean :0.1226
## 3rd Qu.: 39400 3rd Qu.:152 3rd Qu.:227.0 3rd Qu.:0.0000
## Max. :540165 Max. :163 Max. :299.0 Max. :1.0000
sum(is.na(ins_data$Gender))
## [1] 0
sum(is.na(ins_data$Vehicle_Damage))
## [1] 0
sum(is.na(ins_data$Vehicle_Age))
## [1] 0
nrow(ins_data)
## [1] 381109
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
Count_Sales_Channel <- ins_data %>% group_by(Policy_Sales_Channel) %>% summarise(count = n()) %>% arrange(count)
Count_Sales_Channel
## # A tibble: 155 x 2
## Policy_Sales_Channel count
## <dbl> <int>
## 1 41 1
## 2 43 1
## 3 84 1
## 4 123 1
## 5 143 1
## 6 144 1
## 7 149 1
## 8 50 2
## 9 74 2
## 10 75 2
## # ... with 145 more rows
mean_AP <- ins_data %>% group_by(Vehicle_Damage) %>% summarise(mean = mean(Annual_Premium))
mean_AP
## # A tibble: 2 x 2
## Vehicle_Damage mean
## * <chr> <dbl>
## 1 No 30402.
## 2 Yes 30724.
library(ggthemes)
Age_Premium_Plot <- ggplot(data = ins_data) + geom_point(mapping = aes(x = Age, y = Annual_Premium)) + labs(x="Age",y="Annual Premium") + theme_wsj() + ggtitle("Age vs Annual Premium")
Age_Premium_Plot
ggplot(ins_data, aes(Age,fill=Vehicle_Damage))+ geom_bar(stat="count",position='dodge', width = 0.5)+ ggtitle("Age vs Vehicle Damage")
ggplot(ins_data, aes(Gender,fill=Vehicle_Damage))+ geom_bar(stat="count",position='dodge', width = 0.5) + labs(title="Gender vs Vehicle Damage") + theme_bw() + scale_fill_brewer()
ggplot(ins_data, mapping = aes(x = Response, y = Vehicle_Damage))+ theme_stata() + ggtitle("Response count vs Vehicle Damage") + geom_col()
ggplot(ins_data, aes(Response,fill=Vehicle_Age))+ geom_bar(stat="count",position='dodge', width = 0.5) +labs(title="Vehicle Age vs Response count") + theme_bw()+ scale_fill_brewer(palette="Set1")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
ins_data_copy <- sqldf("select * from ins_data")
#Control Structures
ins_data_copy$Gender <- ifelse(ins_data_copy$Gender == "Male", 1, 0)
ins_data_copy$Vehicle_Damage <- ifelse(ins_data_copy$Vehicle_Damage == "Yes", 1, 0)
ins_data_copy$Vehicle_Age <- ifelse(ins_data_copy$Vehicle_Age == "> 2 Years", 2, ifelse(ins_data_copy$Vehicle_Age == "1-2 Year", 1, 0))
str(ins_data_copy)
## 'data.frame': 381109 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : num 1 1 1 1 0 0 1 0 0 0 ...
## $ Age : int 44 76 47 21 29 24 23 56 24 32 ...
## $ Driving_License : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Region_Code : num 28 3 28 11 41 33 11 28 3 6 ...
## $ Previously_Insured : int 0 0 0 1 1 0 0 0 1 1 ...
## $ Vehicle_Age : num 2 1 2 0 0 0 0 1 0 0 ...
## $ Vehicle_Damage : num 1 0 1 0 0 1 1 1 0 0 ...
## $ Annual_Premium : num 40454 33536 38294 28619 27496 ...
## $ Policy_Sales_Channel: num 26 26 26 152 152 160 152 26 152 152 ...
## $ Vintage : int 217 183 27 203 39 176 249 72 28 80 ...
## $ Response : int 1 0 1 0 0 0 0 1 0 0 ...
cor(ins_data_copy)
## id Gender Age Driving_License
## id 1.0000000000 0.0009991434 0.001561212 -0.0005635246
## Gender 0.0009991434 1.0000000000 0.145545034 -0.0183738198
## Age 0.0015612125 0.1455450343 1.000000000 -0.0797819773
## Driving_License -0.0005635246 -0.0183738198 -0.079781977 1.0000000000
## Region_Code -0.0005721027 0.0006041786 0.042573968 -0.0010808797
## Previously_Insured 0.0024570988 -0.0819321594 -0.254682388 0.0149694154
## Vehicle_Age 0.0008781725 0.1604271445 0.765790101 -0.0373057135
## Vehicle_Damage -0.0015074826 0.0916059179 0.267534315 -0.0166220166
## Annual_Premium 0.0030268195 0.0036727445 0.067507002 -0.0119064804
## Policy_Sales_Channel -0.0028370302 -0.1111588141 -0.577825526 0.0437305491
## Vintage -0.0006302473 -0.0025168966 -0.001264079 -0.0008480492
## Response -0.0013681502 0.0524399138 0.111146895 0.0101551746
## Region_Code Previously_Insured Vehicle_Age
## id -0.0005721027 0.002457099 0.0008781725
## Gender 0.0006041786 -0.081932159 0.1604271445
## Age 0.0425739676 -0.254682388 0.7657901011
## Driving_License -0.0010808797 0.014969415 -0.0373057135
## Region_Code 1.0000000000 -0.024658821 0.0437782890
## Previously_Insured -0.0246588212 1.000000000 -0.3808731921
## Vehicle_Age 0.0437782890 -0.380873192 1.0000000000
## Vehicle_Damage 0.0282350121 -0.824142665 0.3968729148
## Annual_Premium -0.0105875101 0.004268759 0.0415803272
## Policy_Sales_Channel -0.0424202424 0.219381069 -0.5506627965
## Vintage -0.0027496311 0.002536787 -0.0018922251
## Response 0.0105698556 -0.341170463 0.2218739872
## Vehicle_Damage Annual_Premium Policy_Sales_Channel
## id -0.001507483 0.0030268195 -2.837030e-03
## Gender 0.091605918 0.0036727445 -1.111588e-01
## Age 0.267534315 0.0675070016 -5.778255e-01
## Driving_License -0.016622017 -0.0119064804 4.373055e-02
## Region_Code 0.028235012 -0.0105875101 -4.242024e-02
## Previously_Insured -0.824142665 0.0042687594 2.193811e-01
## Vehicle_Age 0.396872915 0.0415803272 -5.506628e-01
## Vehicle_Damage 1.000000000 0.0093492910 -2.243769e-01
## Annual_Premium 0.009349291 1.0000000000 -1.132468e-01
## Policy_Sales_Channel -0.224376850 -0.1132468486 1.000000e+00
## Vintage -0.002064373 -0.0006084172 1.849942e-06
## Response 0.354399544 0.0225746955 -1.390415e-01
## Vintage Response
## id -6.302473e-04 -0.001368150
## Gender -2.516897e-03 0.052439914
## Age -1.264079e-03 0.111146895
## Driving_License -8.480492e-04 0.010155175
## Region_Code -2.749631e-03 0.010569856
## Previously_Insured 2.536787e-03 -0.341170463
## Vehicle_Age -1.892225e-03 0.221873987
## Vehicle_Damage -2.064373e-03 0.354399544
## Annual_Premium -6.084172e-04 0.022574696
## Policy_Sales_Channel 1.849942e-06 -0.139041501
## Vintage 1.000000e+00 -0.001050372
## Response -1.050372e-03 1.000000000
set.seed(123)
train_idx <- sample(nrow(ins_data), .70*nrow(ins_data))
ins_train <- ins_data[train_idx,]
ins_test <- ins_data[-train_idx,]
ins_logreg <- glm(Response ~.,family=binomial,data = ins_train)
summary(ins_logreg)
##
## Call:
## glm(formula = Response ~ ., family = binomial, data = ins_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3528 -0.6234 -0.0451 -0.0293 3.9813
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.153e+00 2.125e-01 -19.544 < 2e-16 ***
## id -6.881e-08 5.877e-08 -1.171 0.242
## GenderMale 8.999e-02 1.331e-02 6.763 1.35e-11 ***
## Age -2.555e-02 6.521e-04 -39.177 < 2e-16 ***
## Driving_License 1.321e+00 2.044e-01 6.462 1.03e-10 ***
## Region_Code 2.453e-04 5.212e-04 0.471 0.638
## Previously_Insured -4.004e+00 9.984e-02 -40.103 < 2e-16 ***
## Vehicle_Age> 2 Years 1.387e+00 3.213e-02 43.174 < 2e-16 ***
## Vehicle_Age1-2 Year 1.175e+00 2.247e-02 52.294 < 2e-16 ***
## Vehicle_DamageYes 2.016e+00 4.068e-02 49.568 < 2e-16 ***
## Annual_Premium 2.341e-06 3.542e-07 6.609 3.86e-11 ***
## Policy_Sales_Channel -2.415e-03 1.309e-04 -18.445 < 2e-16 ***
## Vintage -7.412e-05 7.735e-05 -0.958 0.338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 198057 on 266775 degrees of freedom
## Residual deviance: 146383 on 266763 degrees of freedom
## AIC: 146409
##
## Number of Fisher Scoring iterations: 9
The p value for id, Region Code and Vintage are not statistically significant so we remove them and then fit the logistic regression.
ins_logreg_final <- glm(Response ~ Gender+Age+Driving_License+Previously_Insured
+Vehicle_Age+Vehicle_Damage+Annual_Premium+Policy_Sales_Channel,family=binomial,data = ins_train)
summary(ins_logreg_final)
##
## Call:
## glm(formula = Response ~ Gender + Age + Driving_License + Previously_Insured +
## Vehicle_Age + Vehicle_Damage + Annual_Premium + Policy_Sales_Channel,
## family = binomial, data = ins_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3479 -0.6231 -0.0451 -0.0294 3.9827
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.172e+00 2.114e-01 -19.733 < 2e-16 ***
## GenderMale 8.992e-02 1.331e-02 6.758 1.40e-11 ***
## Age -2.554e-02 6.521e-04 -39.173 < 2e-16 ***
## Driving_License 1.322e+00 2.044e-01 6.466 1.00e-10 ***
## Previously_Insured -4.004e+00 9.984e-02 -40.104 < 2e-16 ***
## Vehicle_Age> 2 Years 1.387e+00 3.213e-02 43.174 < 2e-16 ***
## Vehicle_Age1-2 Year 1.175e+00 2.247e-02 52.296 < 2e-16 ***
## Vehicle_DamageYes 2.016e+00 4.068e-02 49.572 < 2e-16 ***
## Annual_Premium 2.339e-06 3.542e-07 6.603 4.03e-11 ***
## Policy_Sales_Channel -2.416e-03 1.309e-04 -18.461 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 198057 on 266775 degrees of freedom
## Residual deviance: 146386 on 266766 degrees of freedom
## AIC: 146406
##
## Number of Fisher Scoring iterations: 9
testing_ins <- predict(ins_logreg_final, ins_test, type = "response")
hist(testing_ins)
Y_hat_mod2 <- as.numeric(testing_ins > 0.25)
table(ins_test$Response, Y_hat_mod2, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0 1
## 0 80122 20090
## 1 4688 9433
accuracy <- mean(ins_test$Response == Y_hat_mod2)
accuracy
## [1] 0.7832822
We explored other classification algorithms to compare the accuracy of our logistic regression model. Therefore, we will create another classification model using Random Forest and then compare these 2 models in order to select the better one for our problem statement.
library(stats)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
ins_train$Response <- as.factor(ins_train$Response)
ins_test$Response <- as.factor(ins_test$Response)
set.seed(500)
RF_Model = randomForest(Response~.,data=ins_train,ntree=100)
predicted_response = predict(RF_Model,ins_test)
ins_test$predicted_response = predicted_response
CFM1 = table(ins_test$Response,ins_test$predicted_response)
CFM1
##
## 0 1
## 0 100077 135
## 1 13972 149
Accuracy_Random_Forest = sum(diag(CFM1)/sum(CFM1))
Accuracy_Random_Forest
## [1] 0.8766148
In the above logistics regression, we found 0.25 to be the probability threshold for determining the classification by mere visual inspection. However, for deciding the probability threshold in a better way, we used ROC and AUC curves in addition.
#ROC and AUC
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
par(pty ="s")
auc_lm <- roc(ins_train$Response,ins_logreg_final$fitted.values, plot = TRUE, legacy.axes= TRUE,
percent = TRUE, xlab = "False Positive %", ylab = "True Positive %",
col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc.df <- data.frame(
tpp = auc_lm$sensitivities*100,
fpp = (1 - auc_lm$specificities)*100,
thresholds = auc_lm$thresholds
)
tradeoff <- roc.df[roc.df$tpp >80 & roc.df$tpp <100, ]
head(tradeoff)
## tpp fpp thresholds
## 222975 99.72690 -9867.163 0.4063777
## 222976 99.72690 -9867.206 0.4063973
## 222977 99.72690 -9867.248 0.4064185
## 222978 99.72690 -9867.291 0.4064446
## 222979 99.72690 -9867.334 0.4064728
## 222980 99.42005 -9867.334 0.4064869
auc_rf <- roc(ins_train$Response,RF_Model$votes[,1], plot = TRUE, legacy.axes= TRUE,
percent = TRUE, xlab = "False Positive %", ylab = "True Positive %",
col="#4daf4a", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc(ins_train$Response,ins_logreg_final$fitted.values, plot = TRUE, legacy.axes= TRUE,
percent = TRUE, xlab = "False Positive %", ylab = "True Positive %",
col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = ins_train$Response, predictor = ins_logreg_final$fitted.values, percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive %", ylab = "True Positive %", col = "#377eb8", lwd = 4, print.auc = TRUE)
##
## Data: ins_logreg_final$fitted.values in 234187 controls (ins_train$Response 0) < 32589 cases (ins_train$Response 1).
## Area under the curve: 83.67%
plot.roc(ins_train$Response,RF_Model$votes[,1], percent = TRUE,
col="#4daf4a", lwd=4, print.auc = TRUE, add=TRUE, print.auc.y=40)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
Hence, we are are using SMOTE that can overcome the challenge of class imbalance and then we will assess the models again on ROC AUC curve.
#SMOTE implementation
library('smotefamily')
library('DMwR')
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:smotefamily':
##
## SMOTE
ins_data$Response = as.factor(ins_data$Response)
ins_data$Gender <- ifelse(ins_data$Gender == "Male", 1, 0)
ins_data$Vehicle_Damage <- ifelse(ins_data$Vehicle_Damage == "Yes", 1, 0)
ins_data$Vehicle_Age <- ifelse(ins_data$Vehicle_Age == "> 2 Years", 2, ifelse(ins_data$Vehicle_Age == "1-2 Year", 1, 0))
set.seed(500)
trainSplit <- SMOTE(Response ~ ., ins_data, perc.over = 300, perc.under=200)
prop.table(table(trainSplit$Response))
##
## 0 1
## 0.6 0.4
SMOTE (Synthetic Minority Oversampling Technique) is used in the above data set to balance out the highly skewed proportions of 1s and 0s. Proportion for 1s was less than 15% in our dataset, hence we have up-sampled the # of 1s and down-sampled the # of 0s in order to achieve a balanced dataset on which classification can be performed.
Post application of SMOTE, we achieved a 60:40 ratio of 0s and 1s respectively.
set.seed(123)
train_idx <- sample(nrow(trainSplit), .70*nrow(trainSplit))
ins_train_sm <- trainSplit[train_idx,]
ins_test_sm <- trainSplit[-train_idx,]
ins_logreg_smote <- glm(Response ~ Gender+Age+Driving_License+Previously_Insured
+Vehicle_Age+Vehicle_Damage+Annual_Premium+Policy_Sales_Channel,
family=binomial,data = ins_train_sm)
summary(ins_logreg_smote)
##
## Call:
## glm(formula = Response ~ Gender + Age + Driving_License + Previously_Insured +
## Vehicle_Age + Vehicle_Damage + Annual_Premium + Policy_Sales_Channel,
## family = binomial, data = ins_train_sm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0788 -0.6873 -0.0711 0.9250 3.5101
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.453e+00 1.201e-01 -20.418 < 2e-16 ***
## Gender 1.048e-01 9.124e-03 11.488 < 2e-16 ***
## Age -2.163e-02 4.526e-04 -47.786 < 2e-16 ***
## Driving_License 1.328e+00 1.153e-01 11.517 < 2e-16 ***
## Previously_Insured -3.983e+00 4.959e-02 -80.310 < 2e-16 ***
## Vehicle_Age 7.652e-01 1.162e-02 65.824 < 2e-16 ***
## Vehicle_Damage 2.012e+00 2.206e-02 91.172 < 2e-16 ***
## Annual_Premium 1.952e-06 2.683e-07 7.274 3.49e-13 ***
## Policy_Sales_Channel -2.868e-03 9.185e-05 -31.221 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 440139 on 326969 degrees of freedom
## Residual deviance: 289710 on 326961 degrees of freedom
## AIC: 289728
##
## Number of Fisher Scoring iterations: 8
testing_smote <- predict(ins_logreg_smote, ins_test_sm, type = "response")
hist(testing_smote)
Y_hat_mod2 <- as.numeric(testing_smote > 0.5)
ins_test_sm$predicted_values = Y_hat_mod2
table(ins_test_sm$Response, Y_hat_mod2, dnn = c("Actual", "Predicted"))
## Predicted
## Actual 0 1
## 0 54439 29676
## 1 5387 50628
accuracy_sm <- mean(ins_test_sm$Response == Y_hat_mod2)
accuracy_sm
## [1] 0.7497823
Random Forest after applying SMOTE
set.seed(500)
RF_Model_sm = randomForest(Response~.,data=ins_train_sm, ntree=100)
predicted_response_sm = predict(RF_Model_sm,ins_test_sm)
ins_test_sm$predicted_response_sm = predicted_response_sm
CFM2 = table(ins_test_sm$Response,ins_test_sm$predicted_response_sm)
CFM2
##
## 0 1
## 0 72706 11409
## 1 6771 49244
Accuracy_Random_Forest_sm = sum(diag(CFM2)/sum(CFM2))
Accuracy_Random_Forest_sm
## [1] 0.8702633
auc_lm_sm <- roc(ins_train_sm$Response,ins_logreg_smote$fitted.values, plot = TRUE, legacy.axes= TRUE,
percent = TRUE, xlab = "False Positive %", ylab = "True Positive %",
col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf_sm <- roc(ins_train_sm$Response,RF_Model_sm$votes[,1], plot = TRUE, legacy.axes= TRUE,
percent = TRUE, xlab = "False Positive %", ylab = "True Positive %",
col="#4daf4a", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc(ins_train_sm$Response,ins_logreg_smote$fitted.values, plot = TRUE, legacy.axes= TRUE,
percent = TRUE, xlab = "False Positive %", ylab = "True Positive %",
col="#377eb8", lwd=4, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = ins_train_sm$Response, predictor = ins_logreg_smote$fitted.values, percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive %", ylab = "True Positive %", col = "#377eb8", lwd = 4, print.auc = TRUE)
##
## Data: ins_logreg_smote$fitted.values in 196145 controls (ins_train_sm$Response 0) < 130825 cases (ins_train_sm$Response 1).
## Area under the curve: 83.28%
plot.roc(ins_train_sm$Response,RF_Model_sm$votes[,1], percent = TRUE,col="#4daf4a", lwd=4, print.auc = TRUE, add=TRUE, print.auc.y=40)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
#Use function to write the file to a location
getwd()
## [1] "C:/Users/yamin/OneDrive/Desktop/Spring Sem 1 First half/BANA 6043 Statistical Computing/Final_Project_Group_2"
write_fn <- function(data,filename){
write.csv(data,filename)
}
write_fn(ins_test_sm, "predicted_vehicle_insurance_data.csv")
The aim of our project was to classify whether customer will take vehicle insurance or not. We used dataset from kaggle and performed exploratory data analysis for descriptive statistics.
For classification, we created 2 classification models i.e. Logistic Regression and Random Forest and calculated accuracy and AUC statistics. We observed that Logistic Regression outperformed Random Forest. However, this result was due to the fact that our dataset was highly imbalanced with proportion of 1s in the response variable less than 15%.
To mitigate the impact of imbalanced dataset, we used SMOTE technique to achieve a 60:40 ratio in the response variable. We again created both classification models and assessed their accuracy. In this case, Random Forest outperformed Logistic Regression based on its accuracy and AUC score.
| Models | Metric | Before SMOTE(%) | After SMOTE (%) |
|---|---|---|---|
| Logistic Regression | Accuracy | 78.33 | 74.98 |
| Logistic Regression | AUC | 83.67 | 83.28 |
| Random Forest | Accuracy | 87.66 | 87.03 |
| Random Forest | AUC | 82.4 | 94.97 |
An ROC curve (receiver operating characteristic curve) is a graph, which shows the performance of a classification model at all classification thresholds (probability at which dependent variable is classified).
ROC curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
\[TPR=(TP)/(TP + FN)\]
False Positive Rate (FPR) is defined as follows:
\[FPR=(FP)/(FP + TN)\]
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve -
AUC represents the measure of separability, which tells how much the classification model is capable of distinguishing between classes. Generally, higher the AUC, the better the model at predicting 1s as 1s and 0s as 0s.
An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. For example - When AUC is 0.9, it means there is a 90% chance that the model will be able to distinguish between positive class and negative class.
To visually represent AUC, it is the area under the ROC curve. We can plot multiple ROC & AUC curve on same axis to compare the performance of different models.
To implement ROC & AUC in R, we can use roc function in R which is present in the pROC library (part of pROC package).
roc function in R outputs the value of AUC as well and plots the curve in case plot attribute is equal to TRUE. We can also print the value of AUC on the curve itself and also plot multiple models on a single curve to compare the model performance.
Example of R code for ROC -
For generating best possible decision trees, the variables which act as nodes in the decision trees are decided by calculating the Information Gain of each independent variable of the dataset.
To define Information Gain precisely, we need to define a measure commonly used in information theory called entropy that measures the level of impurity in a group of examples. Mathematically, it is defined as:
\[\sum_{x = i}^{n} -p * log_2(p_i)\] \[p_i= Probability of class i\] + The following figure shows the form of the entropy function relative to a boolean classification as \(p_+\) varies between 0 and 1.
\[Gain(S,A) = Entropy (S) - \sum_{x = v \epsilon values(A)} rac|S_v||S|. Entropy(S_v)\] + where Values(A) is the set of all possible values for attribute the A, and \(S_v\) is the subset of S for which attribute A has value v. Note the first term in the equation is just entropy of the original sample S, and the second term is the expected value of entropy after S is partitioned using attribute A, i.e. entropy of its children. Expected entropy described by this second term is simply the sum of entropies of each subset \(S_v\), weighted by the fraction of examples $rac{|S_v|}{|S|}$ that belong to \(S_v\). Gain(S, A) is therefore the expected reduction in entropy caused by knowing the value of attribute A.
In short : \[ Information Gain = Entropy(parentnode) - Average Entropy (children)\]
Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction
A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.
The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction).
While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction. So the prerequisites for random forest to perform well are:
\[RFModel = randomForest(Response \sim .,data=dataset)\] + Here the ~ . will regress Response varaible to all the predictor variables in the dataset
SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.
This procedure can be used to create as many synthetic examples for the minority class as are required. It suggests first using random undersampling to trim the number of examples in the majority class, then use SMOTE to oversample the minority class to balance the class distribution.
Example of SMOTE in R:
\[Balanced Data <- SMOTE(Response \sim ., trainSplit, perc.over = 100, perc.under= 100)\]