knitr::opts_chunk$set(
echo = TRUE,
error = FALSE,
fig.align = "center",
message = FALSE,
warning = FALSE,
results = "asis"
)
options(knitr.table.format = "html")#### library loads####
library(ggplot2)
library(dplyr)
library(stringi)
library(knitr)
library(kableExtra)
library(pander)
library(caret)
library(kernlab)
library(verification)
library(bestNormalize) # for boxcox
library(glmnet)
library(tidyverse)
library(dplyr)
library(readr)
library(caret)
library(AER)
library(lmtest) # lrtest(), waldtest()
library(nnet)
library(memisc)
library(stargazer)
library(tables)
library(broom)
library(tidyselect)
options(contrasts = c("contr.treatment", # for non-ordinal factors
"contr.treatment")) # for ordinal factors
options(scipen=20)This dataset is about 9134 customers which have taken vehicle insurance. Source of this data is from Kaggle (https://www.kaggle.com/ranja7/vehicle-insurance-customer-data). The aim of this analysis to get know whether our insurance customers will extend their vehicle insurance based on their behaviour. In this paper, it supports with many libraries which are useful for this analysis such as dplyr, ggplot, caret, glm net and etc. In this dataset we will use some variables to create our models. Below is the explanation our variables
1. Customer - Customer ID, it is unique value
2. State - There are five location where customers live in states (Washington,Arizona, Nevada, California, Oregon)
3. Customer Lifetime Value - Value of customers insurance
4. Response - This will be our dependent variable. with categorical response “Yes” if the customers would like to renew their insurance and “No” if the customers would discontinue their insurance.
5. Coverage - There are 3 types of coverage insurances (Basic, Extended and Premium)
6. Education - Background education of customers (High School or Below, Bachelor, College, Master and Doctor)
7. Effective To Date - The first date when customer would like to actived their car insurance
8. Employment Status - Customer employemen status whether they are Employed, Unemployed, Medical Leave, Disabled, or Retired
9. Gender - F for Female and M for Male
10. Income - Customers income
11. Location Code - Where the customers live likes in Rural, Suburban, and Urban.
12. Marital Status - Customer marital status (Divorced, Married or Single)
13. Monthly Premium Auto - Premium auto that customers need to pay every month
14. Months Since Last Claim - Number of months since customers did last claim
15. Months Since Policy Inception - Number of months since customers did policy inception
16. Number of Open Complaints - Number of complaints
17. Number of Policies - Number of policies in when customers take part of car insurance
18. Policy Type - There are three type of policies in car insurance (Corporate Auto, Personal Auto, and Special Auto)
19. Policy - 3 variety of policies in insurance. There are three policies in each policy types (Corporate L3, Corporate L2, Corporate L1, Personal L3,Personal L2, Personal L1,Special L3, Special L2, Special L1)
20. Renew Offer Type - Each sales of Car Insurance offer 4 type of new insurances to customers. There are Offer 1, Offer 2, Offer 3 and Offer 4
21. Sales Channel - Each sales offer new car insurance by Agent, Call Center, Web and Branch
22. Total Claim Amount - Number of Total Claim Amount when customer did based on their coverage and other considerations.
23. Vehicle Class - Type of vehicle classes that customers have Two-Door Car, Four-Door Car SUV, Luxury SUV, Sports Car, and Luxury Car
24. Vehicle Size - Type of customers vehicle size, there are small, medium and large
head(insurance) %>%
knitr::kable(caption = "Insurance Dataset") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Customer | State | Customer.Lifetime.Value | Response | Coverage | Education | Effective.To.Date | EmploymentStatus | Gender | Income | Location.Code | Marital.Status | Monthly.Premium.Auto | Months.Since.Last.Claim | Months.Since.Policy.Inception | Number.of.Open.Complaints | Number.of.Policies | Policy.Type | Policy | Renew.Offer.Type | Sales.Channel | Total.Claim.Amount | Vehicle.Class | Vehicle.Size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BU79786 | Washington | 2763.519279 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | Suburban | Married | 69 | 32 | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.811147 | Two-Door Car | Medsize |
| QZ44356 | Arizona | 6979.535903 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | Suburban | Single | 94 | 13 | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.464935 | Four-Door Car | Medsize |
| AI49188 | Nevada | 12887.43165 | No | Premium | Bachelor | 2/19/11 | Employed | F | 48767 | Suburban | Married | 108 | 18 | 38 | 0 | 2 | Personal Auto | Personal L3 | Offer1 | Agent | 566.472247 | Two-Door Car | Medsize |
| WW63253 | California | 7645.861827 | No | Basic | Bachelor | 1/20/11 | Unemployed | M | 0 | Suburban | Married | 106 | 18 | 65 | 0 | 7 | Corporate Auto | Corporate L2 | Offer1 | Call Center | 529.881344 | SUV | Medsize |
| HB64268 | Washington | 2813.692575 | No | Basic | Bachelor | 3/2/2011 | Employed | M | 43836 | Rural | Single | 73 | 12 | 44 | 0 | 1 | Personal Auto | Personal L1 | Offer1 | Agent | 138.130879 | Four-Door Car | Medsize |
| OC83172 | Oregon | 8256.2978 | Yes | Basic | Bachelor | 1/25/11 | Employed | F | 62902 | Rural | Married | 69 | 14 | 94 | 0 | 2 | Personal Auto | Personal L3 | Offer2 | Web | 159.383042 | Two-Door Car | Medsize |
Above shows our head of our data and as we can see below. In this part, we did for preparation data, we just need to convert some variables as nominal and ordinal data. We change some of our variables into nominal such as:
State
Response
Employment Status
Gender
Location Code
Marital Status
Policy Type
Sales Channel
Vehicle Class
Also we change some data into ordinal likes:
Coverage
Education
Vehicle Size
Other variables likes Customer Lifetime Value, Income, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, Number of Open Complaints, Number of Policies, Total Claim Amount are numeric.
# partition data
# 70% of training data
# 30% of testing data
set.seed(987654321)
insurance_which_train <- createDataPartition(insurance$Response, p=0.7, list = FALSE)
insurance_train <- insurance[insurance_which_train,]
insurance_test <- insurance[-insurance_which_train,]
# save our dataset in file insurance_dataset files, then we can have similar train and test dataset in the future
save(list = c("insurance_train",
"insurance_test"),
file = "insurance_dataset.RData")#### Check Missing Values ####
na_insurance<-colSums(is.na(insurance_train))
na_insurance %>%
knitr::kable(col.names="NA", caption = "Number of Missing Values") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| NA | |
|---|---|
| Customer | 0 |
| State | 0 |
| Customer.Lifetime.Value | 0 |
| Response | 0 |
| Coverage | 0 |
| Education | 0 |
| Effective.To.Date | 0 |
| EmploymentStatus | 0 |
| Gender | 0 |
| Income | 0 |
| Location.Code | 0 |
| Marital.Status | 0 |
| Monthly.Premium.Auto | 0 |
| Months.Since.Last.Claim | 0 |
| Months.Since.Policy.Inception | 0 |
| Number.of.Open.Complaints | 0 |
| Number.of.Policies | 0 |
| Policy.Type | 0 |
| Policy | 0 |
| Renew.Offer.Type | 0 |
| Sales.Channel | 0 |
| Total.Claim.Amount | 0 |
| Vehicle.Class | 0 |
| Vehicle.Size | 0 |
Based on table above, there are no missing values also in Vehicle insurance dataset. Therefore we don’t need to use any imputation in this dataset.
However, we split our dataset into train and test dataset with proportions 70% for train dataset and 30% for test dataset. There are 6395 for train dataset and 2739 for test dataset. We can see the train dataset in 1.3. Split Train and Test Dataset. We have saved the train and test dataset.
head(insurance_train) %>%
knitr::kable(caption = "Insurance Train Dataset") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Customer | State | Customer.Lifetime.Value | Response | Coverage | Education | Effective.To.Date | EmploymentStatus | Gender | Income | Location.Code | Marital.Status | Monthly.Premium.Auto | Months.Since.Last.Claim | Months.Since.Policy.Inception | Number.of.Open.Complaints | Number.of.Policies | Policy.Type | Policy | Renew.Offer.Type | Sales.Channel | Total.Claim.Amount | Vehicle.Class | Vehicle.Size |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BU79786 | Washington | 2763.519 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | Suburban | Married | 69 | 32 | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.8111 | Two-Door Car | Medsize |
| QZ44356 | Arizona | 6979.536 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | Suburban | Single | 94 | 13 | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.4649 | Four-Door Car | Medsize |
| AI49188 | Nevada | 12887.432 | No | Premium | Bachelor | 2/19/11 | Employed | F | 48767 | Suburban | Married | 108 | 18 | 38 | 0 | 2 | Personal Auto | Personal L3 | Offer1 | Agent | 566.4722 | Two-Door Car | Medsize |
| WW63253 | California | 7645.862 | No | Basic | Bachelor | 1/20/11 | Unemployed | M | 0 | Suburban | Married | 106 | 18 | 65 | 0 | 7 | Corporate Auto | Corporate L2 | Offer1 | Call Center | 529.8813 | SUV | Medsize |
| HB64268 | Washington | 2813.693 | No | Basic | Bachelor | 3/2/2011 | Employed | M | 43836 | Rural | Single | 73 | 12 | 44 | 0 | 1 | Personal Auto | Personal L1 | Offer1 | Agent | 138.1309 | Four-Door Car | Medsize |
| OC83172 | Oregon | 8256.298 | Yes | Basic | Bachelor | 1/25/11 | Employed | F | 62902 | Rural | Married | 69 | 14 | 94 | 0 | 2 | Personal Auto | Personal L3 | Offer2 | Web | 159.3830 | Two-Door Car | Medsize |
In this report, we need to do some visualization and to know our data further
ggplot(data=insurance_train, aes(x=Response))+
geom_bar( fill=c("#808080","#1E90FF")) +
theme_minimal() +
theme_classic() +
labs(title="Number of Response") * As we can see that our independent variable is not balance, we have people will no continue to use vehicle insurance about 5479 and people will not use vehicle insurance about 916. This can be our consideration to analysis this data with resampling or feature engineering to improve our model
ggplot(insurance_train, aes(x=Response, y=Total.Claim.Amount)) +
geom_boxplot(fill='#A4A4A4', color="darkblue") +
theme_minimal() +
labs(title = "Boxplot of Customers Decision") * Based on boxplot above, it shows that customers who have high Total Claim Amount decided not to renew their vehicle insurance but people who decide to renew their vehicle insurance mostly has got Total Claim Amount around 7.345946 and 1358.4
channel <-insurance_train %>%
group_by(Sales.Channel) %>%
summarise(n=n())
channel$perc <- channel$n/nrow(insurance_train)
pie = ggplot(channel, aes(x="", y=perc, fill=Sales.Channel)) + geom_bar(stat="identity")
# Convert to pie (polar coordinates) and add labels
pie + coord_polar("y", start=0)+
geom_text(aes(label = paste0(round(perc*100), "%")), position = position_stack(vjust = 0.5)) +
theme_classic() + theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5)) +
labs(x = NULL, y = NULL, fill = NULL, title = "Sales Channel Vehicle Insurance") new_offer <- insurance_train %>%
group_by(Gender, Renew.Offer.Type)%>%
summarise(n=n())
ggplot(new_offer, aes(x = Renew.Offer.Type, y = n))+
labs(x = NULL, y = NULL, fill = NULL, title = "Offering Renew Offer Type") +
geom_bar(
aes(fill = Renew.Offer.Type), stat = "identity", color = "white",
position = position_dodge(0.9)
)+
theme_classic() +
facet_wrap(~Gender) * Based on bar plot above it can be new strategy for the company, How to offer the Renew Vehicle insurance based on Gender. Nevetheless, more or less nothing different between Female and Male when they decide to renew their insurance. The difference would be in Offer type 4, it shows that Male is more interested to renew their vehicle insurance in new offer type 4 than female but we need to investigate further about this, if company would like to have new strategy based on gender
ggplot(data=insurance_train, aes(x=Response))+
geom_bar( fill="darkgreen") +
theme_minimal() +
labs(title="Response") +
coord_flip() *In this part, it would like to analyse using three different methods for categorical dependent variable using methods like,
1.) Logistic Regression
2.) KNN
3.) Ridge Regression and Lasso
In all analysis, we select “Response” as our dependent variable and we drop two variables likes “Customer” and “Effective to Date” since it did not useful in analysis.
In this first method, we compared two regressions, There are Logit Regression and Probit Regression. Below we can see our comparison table between Logit Regression and Probit Regression
c1 <- insurance_train_logit$coefficients
c2 <- insurance_train_probit$coefficients
c12 <- cbind(c1,c2)
c12 %>%
knitr::kable(col.names = c("Regression Logit", "Regression Probit"),caption = "Comparison Between Logit & Probit") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Regression Logit | Regression Probit | |
|---|---|---|
| (Intercept) | -2.1238872 | -1.2448839 |
| StateCalifornia | 0.1031821 | 0.0487939 |
| StateNevada | 0.0927564 | 0.0353556 |
| StateOregon | 0.1018565 | 0.0523793 |
| StateWashington | 0.0578533 | 0.0287346 |
| Customer.Lifetime.Value | -0.0000028 | -0.0000016 |
| CoverageExtended | -0.0474214 | -0.0230827 |
| CoveragePremium | 0.2626227 | 0.1391050 |
| EducationBachelor | 0.1333182 | 0.0677062 |
| EducationCollege | 0.3392410 | 0.1826816 |
| EducationMaster | 0.5230953 | 0.3298529 |
| EducationDoctor | 0.5803362 | 0.3169402 |
| EmploymentStatusEmployed | -0.2400799 | -0.1302629 |
| EmploymentStatusMedical Leave | 0.1347016 | 0.0755668 |
| EmploymentStatusRetired | 2.3285987 | 1.3790203 |
| EmploymentStatusUnemployed | -0.6263815 | -0.3418276 |
| GenderM | 0.0371003 | 0.0201790 |
| Income | 0.0000045 | 0.0000030 |
| Location.CodeSuburban | 1.3666499 | 0.7923865 |
| Location.CodeUrban | 0.1101430 | 0.0695286 |
| Marital.StatusMarried | -0.5858373 | -0.3227722 |
| Marital.StatusSingle | -0.6010885 | -0.3409333 |
| Monthly.Premium.Auto | 0.0003787 | 0.0001249 |
| Months.Since.Last.Claim | -0.0047451 | -0.0023458 |
| Months.Since.Policy.Inception | 0.0022964 | 0.0011636 |
| Number.of.Open.Complaints | -0.0143820 | -0.0016643 |
| Number.of.Policies | -0.0115261 | -0.0071925 |
| Policy.TypePersonal Auto | 0.0137366 | 0.0082449 |
| Policy.TypeSpecial Auto | -0.1999415 | -0.0837278 |
| PolicyPersonal L2 | -0.1288645 | -0.0770827 |
| PolicyPersonal L3 | -0.1720342 | -0.0930629 |
| PolicySpecial L1 | 0.1673612 | 0.0587742 |
| PolicySpecial L3 | 0.6399175 | 0.3355414 |
| PolicySpecial L2 | NA | NA |
| PolicyCorporate L1 | 0.0768275 | 0.0262502 |
| PolicyCorporate L2 | 0.1404268 | 0.0640978 |
| PolicyCorporate L3 | NA | NA |
| Renew.Offer.TypeOffer2 | 0.6814189 | 0.3839137 |
| Renew.Offer.TypeOffer3 | -2.2073937 | -1.0501764 |
| Renew.Offer.TypeOffer4 | -16.7608955 | -5.7247953 |
| Sales.ChannelBranch | -0.6435594 | -0.3754183 |
| Sales.ChannelCall Center | -0.5952724 | -0.3547292 |
| Sales.ChannelWeb | -0.6349736 | -0.3525787 |
| Total.Claim.Amount | -0.0012542 | -0.0007041 |
| Vehicle.ClassLuxury Car | -0.0995168 | -0.0001644 |
| Vehicle.ClassLuxury SUV | 0.7643857 | 0.3962873 |
| Vehicle.ClassSports Car | 0.7089762 | 0.3898242 |
| Vehicle.ClassSUV | 0.5048990 | 0.2826464 |
| Vehicle.ClassTwo-Door Car | 0.0146988 | 0.0089212 |
| Vehicle.SizeMedsize | 0.3736937 | 0.1914199 |
| Vehicle.SizeLarge | 0.7382753 | 0.3968874 |
if we put our significance value is 5% then we got our 15 significance variable with both methods, regression logit and regression probit. There are Education L, EmploymentStatusRetired, EmploymentStatusUnemployed, Location.CodeSuburban, Marital.StatusMarried, Marital.StatusSingle, Renew.Offer.TypeOffer2, Renew.Offer.TypeOffer3, Sales.ChannelBranch, Sales.ChannelCall Center, Sales.ChannelCall Center, Sales.ChannelWeb, Total.Claim.Amount, Vehicle.ClassSports Car and Vehicle.Size.L.
But if we compare based on the AIC, we got the best method with probit, with very small different. In regression probit has AIC about 4201.4859018 but in regression logit it has AIC about 4207.3016481. But if we check based on model evaluation using their predicted value, you can check the result in table below
#### Comparing Balanced Accuracy Basic Model ####
summary_logit <- summary_binary(real=insurance_train$Response, predicted_probs = insurance_train_logit_fitted)
summary_probit <- summary_binary(real=insurance_train$Response, predicted_probs = insurance_train_probit_fitted)
summary_results<- cbind(summary_logit,summary_probit)
summary_results %>%
knitr::kable(col.names=c("Logit Regression","Probit Regression"), caption = "Comparing result") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Logit Regression | Probit Regression | |
|---|---|---|
| Accuracy | 0.86927 | 0.86943 |
| Sensitivity | 0.15721 | 0.15284 |
| Specificity | 0.98832 | 0.98923 |
| Pos Pred Value | 0.69231 | 0.70352 |
| Neg Pred Value | 0.87522 | 0.87476 |
| F1 | 0.25623 | 0.25112 |
| Balanced Accuracy | 0.57276 | 0.57104 |
Based on table above, There are no so significant difference in all parameters but since response variable is unbalanced then we need to pay attention in balanced accuracy. Both of them has similar balanced accuracy 57%. Then to get best method in logit regression, we can use feature engineering and resampling to increase the balanced accuracy. In feature engineering, we will use cross validation and in resampling methods, we will use down, up, rose and smote methods.
In this part, we’re going to improve our best model probit regression with feature engineering (cross validation, quantiles() and resampling (up, down, smote).
#### Best Method After Improving - Logit ####
models_all_insurance_probit <- ls(pattern ="insurance_train_probit2")
result2<-models_all_insurance_probit %>%
sapply(function(x) get(x) %>%
# use it for prediction
# in the TRAIN sample
predict(newdata = insurance_train) %>%
# apply the summary
summary_binary_class(level_positive = "Yes",
level_negative = "No",
real = insurance_train$Response))
summary_results2<- cbind(summary_results, result2)
summary_results2 %>%
knitr::kable(col.names=c("Logit Regression","Probit Regression", "insurance_train_probit2_cv", "insurance_train_probit2_down",
"insurance_train_probit2_quantiles", "insurance_train_probit2_smote", "insurance_train_probit2_up"),
caption = "Comparing result") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE) | Logit Regression | Probit Regression | insurance_train_probit2_cv | insurance_train_probit2_down | insurance_train_probit2_quantiles | insurance_train_probit2_smote | insurance_train_probit2_up | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.86927 | 0.86943 | 0.86943 | 0.71040 | 0.86943 | 0.78170 | 0.70868 |
| Sensitivity | 0.15721 | 0.15284 | 0.15284 | 0.76201 | 0.15175 | 0.50873 | 0.77838 |
| Specificity | 0.98832 | 0.98923 | 0.98923 | 0.70177 | 0.98941 | 0.82734 | 0.69703 |
| Pos Pred Value | 0.69231 | 0.70352 | 0.70352 | 0.29931 | 0.70558 | 0.33003 | 0.30046 |
| Neg Pred Value | 0.87522 | 0.87476 | 0.87476 | 0.94635 | 0.87464 | 0.90969 | 0.94953 |
| F1 | 0.25623 | 0.25112 | 0.25112 | 0.42980 | 0.24978 | 0.40034 | 0.43357 |
| Balanced Accuracy | 0.57276 | 0.57104 | 0.57104 | 0.73189 | 0.57058 | 0.66804 | 0.73770 |
After we improved our model with resampling and feature engineering. Then it needs to merge all models together. We got the highest balanced accuracy using resampling up about 74%, also in this model sensitivity is the highest about 78%. Eventhough in this model, specificity is not the lowest about 69%. Then we will interpret the best model (with up sampling)
#### best method in logit regression ####
best_logit_method <- summary(insurance_train_probit2_up)
coef <- best_logit_method$coefficients
coef %>%
knitr::kable(caption = "Best Logit & Probit Regression Results") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.1218218 | 0.1815636 | -0.6709596 | 0.5022463 |
| StateCalifornia | 0.0735430 | 0.0402105 | 1.8289517 | 0.0674068 |
| StateNevada | 0.0558561 | 0.0549221 | 1.0170057 | 0.3091507 |
| StateOregon | 0.0782687 | 0.0417387 | 1.8752041 | 0.0607647 |
| StateWashington | 0.0255310 | 0.0580244 | 0.4400050 | 0.6599335 |
| Customer.Lifetime.Value | -0.0000020 | 0.0000022 | -0.8984364 | 0.3689530 |
| CoverageExtended | -0.0050130 | 0.0537187 | -0.0933189 | 0.9256502 |
| CoveragePremium | 0.2374842 | 0.1113627 | 2.1325300 | 0.0329633 |
| EducationBachelor | 0.0367610 | 0.0375603 | 0.9787192 | 0.3277188 |
| EducationCollege | 0.1854272 | 0.0374743 | 4.9481236 | 0.0000007 |
| EducationMaster | 0.4361578 | 0.0565841 | 7.7081343 | 0.0000000 |
| EducationDoctor | 0.3322843 | 0.0751924 | 4.4191224 | 0.0000099 |
| EmploymentStatusEmployed | -0.0838079 | 0.0739979 | -1.1325724 | 0.2573938 |
EmploymentStatusMedical Leave
|
0.1328668 | 0.0859182 | 1.5464346 | 0.1219996 |
| EmploymentStatusRetired | 1.3034502 | 0.1005039 | 12.9691511 | 0.0000000 |
| EmploymentStatusUnemployed | -0.3609602 | 0.0738425 | -4.8882476 | 0.0000010 |
| GenderM | 0.0509833 | 0.0282586 | 1.8041657 | 0.0712053 |
| Income | 0.0000036 | 0.0000008 | 4.3380746 | 0.0000144 |
| Location.CodeSuburban | 0.9369885 | 0.0579336 | 16.1734845 | 0.0000000 |
| Location.CodeUrban | 0.1049998 | 0.0545251 | 1.9257148 | 0.0541400 |
| Marital.StatusMarried | -0.3665222 | 0.0393725 | -9.3090948 | 0.0000000 |
| Marital.StatusSingle | -0.4122225 | 0.0444598 | -9.2717940 | 0.0000000 |
| Monthly.Premium.Auto | -0.0017740 | 0.0021833 | -0.8125352 | 0.4164846 |
| Months.Since.Last.Claim | -0.0014869 | 0.0014066 | -1.0570601 | 0.2904842 |
| Months.Since.Policy.Inception | 0.0009104 | 0.0005021 | 1.8130751 | 0.0698202 |
| Number.of.Open.Complaints | 0.0258598 | 0.0153149 | 1.6885337 | 0.0913088 |
| Number.of.Policies | -0.0075177 | 0.0058589 | -1.2831207 | 0.1994498 |
Policy.TypePersonal Auto
|
-0.0175437 | 0.0553116 | -0.3171799 | 0.7511071 |
Policy.TypeSpecial Auto
|
-0.0969069 | 0.1185942 | -0.8171300 | 0.4138541 |
PolicyPersonal L2
|
-0.0933147 | 0.0467180 | -1.9974033 | 0.0457814 |
PolicyPersonal L3
|
-0.0907951 | 0.0432279 | -2.1003816 | 0.0356953 |
PolicySpecial L1
|
-0.0307607 | 0.2072051 | -0.1484555 | 0.8819833 |
PolicySpecial L3
|
0.3028675 | 0.1473215 | 2.0558266 | 0.0397992 |
PolicyCorporate L1
|
-0.0226721 | 0.0827581 | -0.2739566 | 0.7841180 |
PolicyCorporate L2
|
0.0809377 | 0.0661026 | 1.2244245 | 0.2207921 |
| Renew.Offer.TypeOffer2 | 0.4404619 | 0.0323020 | 13.6357557 | 0.0000000 |
| Renew.Offer.TypeOffer3 | -1.2440738 | 0.0597121 | -20.8345313 | 0.0000000 |
| Renew.Offer.TypeOffer4 | -6.4781901 | 30.8175971 | -0.2102107 | 0.8335032 |
| Sales.ChannelBranch | -0.4435167 | 0.0351675 | -12.6115596 | 0.0000000 |
Sales.ChannelCall Center
|
-0.4645988 | 0.0407413 | -11.4036333 | 0.0000000 |
| Sales.ChannelWeb | -0.3508603 | 0.0437917 | -8.0120271 | 0.0000000 |
| Total.Claim.Amount | -0.0007860 | 0.0001083 | -7.2552908 | 0.0000000 |
Vehicle.ClassLuxury Car
|
0.0709991 | 0.3167809 | 0.2241268 | 0.8226587 |
Vehicle.ClassLuxury SUV
|
0.4879384 | 0.3074885 | 1.5868507 | 0.1125465 |
Vehicle.ClassSports Car
|
0.4595172 | 0.1101240 | 4.1727246 | 0.0000301 |
| Vehicle.ClassSUV | 0.4200763 | 0.0975002 | 4.3084675 | 0.0000164 |
Vehicle.ClassTwo-Door Car
|
0.0113326 | 0.0364641 | 0.3107863 | 0.7559631 |
| Vehicle.SizeMedsize | 0.1554925 | 0.0381214 | 4.0788744 | 0.0000453 |
| Vehicle.SizeLarge | 0.3782757 | 0.0535173 | 7.0682842 | 0.0000000 |
Based on table above, we got our significant variables (which have below 5%) likes Education College, Education Master, Education Doctor, EmploymentStatusRetired, Employment Status Unemployed, Location.CodeSuburban, Marital.StatusMarried, Marital.StatusSingle, Renew.Offer.TypeOffer2, Renew.Offer.TypeOffer3, Sales.ChannelBranch , Sales.ChannelCall Center, Sales.ChannelWeb, Total.Claim.Amount, Vehicle.ClassSports Car, Vehicle.SizeLarge.
1. Education College the increase number of customers who have background in College will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below
2. Education Master the increase number of customers who have background in Master will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below
3. Education Doctor the increase number of customers who have background in Doctor will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below
4. EmploymentStatusRetired the increase number of customers who have employment status as Retired will increase the probability consumers to renew their insurance as compared to customer who have employment status as Disabled
5. EmploymentStatusUnemployed the increase number of customers who have employment status as Unemployed will increase the probability consumers to renew their insurance as compared to customer who have employment status as Disabled
6. Location.CodeSuburban the increase number of customers who live in Suburban will increase the probability consumers to renew their insurance
7. Marital.StatusMarried the increase number of customers who have marital status as Married will decrease the probability consumers to renew their insurance as compared to customer who have married status as Divorced
8. Marital.StatusSingle the increase number of customers who have marital status as Single will decrease the probability consumers to renew their insurance as compared to customer who have married status as Divorced
9.Renew.Offer.TypeOffer2 the increase number of customers who decide renew offer type offer 2 will increase the probability consumers to renew their insurance as compared to customer who decide renew offer type 1
10.Renew.Offer.TypeOffer3 the increase number of customers who decide renew offer type offer 3 will decrease the probability consumers to renew their insurance as compared to customer who decide renew offer type 1
11.Sales.ChannelBranch the increase number of customers who have bought vehicle insurance by Branch will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent
12.Sales.ChannelCall Center the increase number of customers who have bought vehicle insurance by Call Center will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent
13.Sales.ChannelWeb the increase number of customers who have bought vehicle insurance by Web will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent
14.Total.Claim.Amount the increase number of customers who have Total Claim Amount will decrease the probability consumers to renew their insurance
15.Vehicle.ClassSports Car the increase number of customers who have Sports Car increase the probability consumers to renew their insurance as compared to customer who have Four Door Car
16.Vehicle.SizeLarge. the increase number of customers who have large vehicles increase the probability consumers to renew their insurance as compared to customer who have small vehicles
In this KNN part, we build 5 models. first model is the basic KNN, second model is with tunegrid k=80, third model is with different k, fourth model and last model are with scaled
#### Summary Result KNN ####
summary_result_knn_test %>%
knitr::kable(caption = "Best Logit KNN Results based on test dataset") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | F1 | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|
| insurace_train_knn_5 | 0.84155 | 0.81378 | 0.84619 | 0.46912 | 0.96455 | 0.59515 | 0.82998 |
| insurance_train_knn_80 | 0.85688 | 0.00000 | 1.00000 | NaN | 0.85688 | NA | 0.50000 |
| insurance_train_knn_tunned | 0.97189 | 1.00000 | 0.96719 | 0.83582 | 1.00000 | 0.91057 | 0.98360 |
| insurance_train_knn_scaled | 0.90471 | 0.64796 | 0.94759 | 0.67374 | 0.94157 | 0.66060 | 0.79778 |
| insurance_train_knn_scaled2 | 0.90507 | 0.64286 | 0.94887 | 0.67742 | 0.94085 | 0.65969 | 0.79586 |
Based on prediction with test dataset, the highest balanced accuracy with model knn with tunned about 98% and also it has highest sensitivity since most predictions are “No” about 5479 predictions and and “Yes” only 916 observations. the highest specificity is model with knn with k different is 80, therefore it has 100% specificity and since our roc area is not quite good. But we can do resampling, since we have unbalanced response (100% data is negative).
#### Summary Result KNN ####
summary_result_knn_test %>%
knitr::kable(caption = "Best Logit KNN Results based on test dataset") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | F1 | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|
| insurace_train_knn_5 | 0.84155 | 0.81378 | 0.84619 | 0.46912 | 0.96455 | 0.59515 | 0.82998 |
| insurance_train_knn_80 | 0.85688 | 0.00000 | 1.00000 | NaN | 0.85688 | NA | 0.50000 |
| insurance_train_knn_tunned | 0.97189 | 1.00000 | 0.96719 | 0.83582 | 1.00000 | 0.91057 | 0.98360 |
| insurance_train_knn_scaled | 0.90471 | 0.64796 | 0.94759 | 0.67374 | 0.94157 | 0.66060 | 0.79778 |
| insurance_train_knn_scaled2 | 0.90507 | 0.64286 | 0.94887 | 0.67742 | 0.94085 | 0.65969 | 0.79586 |
However, we also tried to predict based on train dataset and the conclusion is the same that insurance_train_knn_tunned has the highest balanced accuracy but with train dataset is overfitting because the balanced accuracy is 100%. We would like to know what will happen if we still continue to improve our model with feature engineering and resampling.
#### Summary Result KNN based on train dataset####
summary_result_knn_train %>%
knitr::kable(caption = "Best Logit KNN Results based on train dataset") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | F1 | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|
| insurace_train_knn_5 | 0.90008 | 0.96834 | 0.88867 | 0.59252 | 0.99408 | 0.73518 | 0.92850 |
| insurance_train_knn_80 | 0.85676 | 0.00000 | 1.00000 | NaN | 0.85676 | NA | 0.50000 |
| insurance_train_knn_tunned | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| insurance_train_knn_scaled | 0.94464 | 0.81987 | 0.96550 | 0.79894 | 0.96975 | 0.80927 | 0.89269 |
| insurance_train_knn_scaled2 | 0.94402 | 0.81878 | 0.96496 | 0.79618 | 0.96956 | 0.80732 | 0.89187 |
but before that, we will show our best model with KNN (based on test dataset). In this model, it also has high ROC area about 97% which mean the prediction of train dataset is very good.
#### Best Method in KNN - Scaled ####
insurance_forecasts_knn_scaled <- predict(insurance_train_knn_tunned_scaled,
insurance_train, type="prob")
confusionMatrix(data=as.factor(ifelse(insurance_forecasts_knn_scaled["Yes"]> 0.5,
"Yes","No")),
reference = insurance_train$Response,
positive="Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes No 5306 169 Yes 173 747
Accuracy : 0.9465
95% CI : (0.9407, 0.9519)
No Information Rate : 0.8568
P-Value [Acc > NIR] : <0.0000000000000002
Kappa : 0.7825
Mcnemar’s Test P-Value : 0.8711
Sensitivity : 0.8155
Specificity : 0.9684
Pos Pred Value : 0.8120
Neg Pred Value : 0.9691
Prevalence : 0.1432
Detection Rate : 0.1168
Detection Prevalence : 0.1439
Balanced Accuracy : 0.8920
'Positive' Class : Yes
[1] 0.9732075
In this part, we’re going to improve our KNN with feature engineering (cross validation, quantiles() and resampling (up, down, smote).
#### Best Method After Improving - KNN ####
# this is based on train data
models_all_insurance_knn2 <- ls(pattern ="insurance_train_knn2")
summary_results_knn2<-models_all_insurance_knn2 %>%
sapply(function(x) get(x) %>%
predict(newdata = insurance_train) %>%
summary_binary_class(level_positive = "Yes",
level_negative = "No",
real = insurance_train$Response)) %>%
t()
summary_results2_knn<- rbind(summary_result_knn_train,summary_results_knn2)
summary_results2_knn %>%
knitr::kable(caption = "Best KNN Methods Results") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | F1 | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|
| insurace_train_knn_5 | 0.90008 | 0.96834 | 0.88867 | 0.59252 | 0.99408 | 0.73518 | 0.92850 |
| insurance_train_knn_80 | 0.85676 | 0.00000 | 1.00000 | NaN | 0.85676 | NA | 0.50000 |
| insurance_train_knn_tunned | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| insurance_train_knn_scaled | 0.94464 | 0.81987 | 0.96550 | 0.79894 | 0.96975 | 0.80927 | 0.89269 |
| insurance_train_knn_scaled2 | 0.94402 | 0.81878 | 0.96496 | 0.79618 | 0.96956 | 0.80732 | 0.89187 |
| insurance_train_knn2_down | 0.85958 | 1.00000 | 0.83610 | 0.50496 | 1.00000 | 0.67106 | 0.91805 |
| insurance_train_knn2_tunned_cv | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| insurance_train_knn2_tunned_quantiles | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| insurance_train_knn2_up | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
Eventhough in our KNN Methods have high balanced accuracy but the the sensitivy was very high. Then the balanced accuracy is confirm that there is suspicious thing from this model, it might be because of overfitting. Model likes insurance_train_knn2_tunned_cv, insurance_train_knn2_tunned_quantiles, and insurance_train_knn2_up has balanced accuracy about 100%. Our best KNN model here when k=1, with the accuracy 95% and ROC about 96% which is very good model. Sensitivity and Specificity about 95% and 98% respectively.
In this ridge regression and lasso, we will analysis various type of ridge regression and lasso, we use two different parameters. For basic ridge and lasso regression we use parameter with 200 lambdas and second parameter is 10000 lambdas. However, we also include elastic parameter to get different result.
#### Comparing Ridge and Lasso Regression ####
#### comparing all ridge and lasso results ####
models_all_insurance_result_lambda <- ls(pattern ="result_lambda") # lasso & ridge
#models_all_insurance_result_lambda %>%
# sapply(function(x) get(x) %>%
# # use it for prediction
# # in the TRAIN sample
# predict(newdata = insurance_train) %>%
# # apply the summary
# summary_binary_class(level_positive = "Yes",
# level_negative = "No",
# real = insurance_train$Response))
models_all_insurance_result_lambda <- rbind(result_lambda_ridge,result_lambda_ridge2, result_lambda_logit,
result_lambda_lasso, result_lambda_elastic, result_lambda_elastic2)
models_all_insurance_result_lambda %>%
knitr::kable(caption = "Comparing result") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | F1 | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|
| result_lambda_ridge | 0.87224 | 0.14301 | 0.99416 | 0.80368 | 0.87404 | 0.24282 | 0.56859 |
| result_lambda_ridge2 | 0.85676 | 0.00000 | 1.00000 | NaN | 0.85676 | NA | 0.50000 |
| result_lambda_logit | 0.87162 | 0.14629 | 0.99288 | 0.77457 | 0.87432 | 0.24610 | 0.56959 |
| result_lambda_lasso | 0.87162 | 0.14192 | 0.99361 | 0.78788 | 0.87384 | 0.24052 | 0.56777 |
| result_lambda_elastic | 0.85676 | 0.00000 | 1.00000 | NaN | 0.85676 | NA | 0.50000 |
| result_lambda_elastic2 | 0.87177 | 0.14192 | 0.99379 | 0.79268 | 0.87386 | 0.24074 | 0.56786 |
Based on that result all balanced accuracy is not quite good because most of balanced accuracies have similar value, around 50%-56%. but the highest balanced accuracy is in basic ridge regression with parameter lambda and cross validation. Then we need to improve the models to get higher balance accuracy with resampling and feature engineering.
In this part, we’re going to improve our Ridge Regression and Lasso with feature engineering (cross validation, quantiles() and resampling (up, down, smote).
#### Best Method After Improving - Ridge and Lasso ####
models_all_insurance_lambda2 <- ls(pattern ="insurance_train_lambda2")
summary_results_lambda2<-models_all_insurance_lambda2 %>%
sapply(function(x) get(x) %>%
predict(newdata = insurance_train) %>%
summary_binary_class(level_positive = "Yes",
level_negative = "No",
real = insurance_train$Response)) %>%
t()
summary_results2_lambda<- rbind(models_all_insurance_result_lambda,
summary_results_lambda2)
summary_results2_lambda %>%
knitr::kable(caption = "Best Ridge and Lasso Regression Methods Results") %>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)| Accuracy | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | F1 | Balanced Accuracy | |
|---|---|---|---|---|---|---|---|
| result_lambda_ridge | 0.87224 | 0.14301 | 0.99416 | 0.80368 | 0.87404 | 0.24282 | 0.56859 |
| result_lambda_ridge2 | 0.85676 | 0.00000 | 1.00000 | NaN | 0.85676 | NA | 0.50000 |
| result_lambda_logit | 0.87162 | 0.14629 | 0.99288 | 0.77457 | 0.87432 | 0.24610 | 0.56959 |
| result_lambda_lasso | 0.87162 | 0.14192 | 0.99361 | 0.78788 | 0.87384 | 0.24052 | 0.56777 |
| result_lambda_elastic | 0.85676 | 0.00000 | 1.00000 | NaN | 0.85676 | NA | 0.50000 |
| result_lambda_elastic2 | 0.87177 | 0.14192 | 0.99379 | 0.79268 | 0.87386 | 0.24074 | 0.56786 |
| insurance_train_lambda2_cv | 0.87193 | 0.14192 | 0.99398 | 0.79755 | 0.87388 | 0.24096 | 0.56795 |
| insurance_train_lambda2_down | 0.70242 | 0.76638 | 0.69173 | 0.29360 | 0.94655 | 0.42455 | 0.72905 |
| insurance_train_lambda2_quantiles | 0.87193 | 0.14192 | 0.99398 | 0.79755 | 0.87388 | 0.24096 | 0.56795 |
| insurance_train_lambda2_smote | 0.85442 | 0.06114 | 0.98704 | 0.44094 | 0.86280 | 0.10738 | 0.52409 |
| insurance_train_lambda2_up | 0.70446 | 0.79258 | 0.68972 | 0.29926 | 0.95213 | 0.43447 | 0.74115 |
Based on table above, as we can see that the highest accuracy is model with up resampling about 74% and it has the highest sensitivity about 79%. but the highest specifity is model with resampling with smote. then we would analyse our best based model below
Our best Ridge and Lasso Regressions here is Lasso Regression (resampling up) with lambda = 0.0128989 with the accuracy is 70% and balanced accuracy is 74%. and our ROC is quite high about 80% which means that our model is good. Our sensitivity about 69% and specificity is 73%.Kappa 26.5%
From all methods, as we know each method increase the balanced accuracy after Resampling implemented since our dependent variable is not balance,
if we compare based on accuracies in all models, We can say KNN is the best model for this dataset since it has the highest balanced accuracy