knitr::opts_chunk$set(
    echo = TRUE,
    error = FALSE,
    fig.align = "center",
    message = FALSE,
    warning = FALSE,
    results = "asis"
)
options(knitr.table.format = "html")

#### library loads####
library(ggplot2)
library(dplyr)
library(stringi)
library(knitr)
library(kableExtra)
library(pander)
library(caret)
library(kernlab)
library(verification)
library(bestNormalize) # for boxcox
library(glmnet)
library(tidyverse)
library(dplyr)
library(readr)
library(caret)
library(AER)
library(lmtest) # lrtest(), waldtest()
library(nnet)
library(memisc)
library(stargazer)
library(tables)
library(broom)
library(tidyselect)

options(contrasts = c("contr.treatment",  # for non-ordinal factors 
                      "contr.treatment")) # for ordinal factors

options(scipen=20)

# Read insurance
insurance <- read.csv2("AutoInsurance.csv", sep=",", header=TRUE)

1. Description of data and problem analyzed

This dataset is about 9134 customers which have taken vehicle insurance. Source of this data is from Kaggle (https://www.kaggle.com/ranja7/vehicle-insurance-customer-data). The aim of this analysis to get know whether our insurance customers will extend their vehicle insurance based on their behaviour. In this paper, it supports with many libraries which are useful for this analysis such as dplyr, ggplot, caret, glm net and etc. In this dataset we will use some variables to create our models. Below is the explanation our variables

1.1 Data Description

1. Customer - Customer ID, it is unique value

2. State - There are five location where customers live in states (Washington,Arizona, Nevada, California, Oregon)

3. Customer Lifetime Value - Value of customers insurance

4. Response - This will be our dependent variable. with categorical response “Yes” if the customers would like to renew their insurance and “No” if the customers would discontinue their insurance.

5. Coverage - There are 3 types of coverage insurances (Basic, Extended and Premium)

6. Education - Background education of customers (High School or Below, Bachelor, College, Master and Doctor)

7. Effective To Date - The first date when customer would like to actived their car insurance

8. Employment Status - Customer employemen status whether they are Employed, Unemployed, Medical Leave, Disabled, or Retired

9. Gender - F for Female and M for Male

10. Income - Customers income

11. Location Code - Where the customers live likes in Rural, Suburban, and Urban.

12. Marital Status - Customer marital status (Divorced, Married or Single)

13. Monthly Premium Auto - Premium auto that customers need to pay every month

14. Months Since Last Claim - Number of months since customers did last claim

15. Months Since Policy Inception - Number of months since customers did policy inception

16. Number of Open Complaints - Number of complaints

17. Number of Policies - Number of policies in when customers take part of car insurance

18. Policy Type - There are three type of policies in car insurance (Corporate Auto, Personal Auto, and Special Auto)

19. Policy - 3 variety of policies in insurance. There are three policies in each policy types (Corporate L3, Corporate L2, Corporate L1, Personal L3,Personal L2, Personal L1,Special L3, Special L2, Special L1)

20. Renew Offer Type - Each sales of Car Insurance offer 4 type of new insurances to customers. There are Offer 1, Offer 2, Offer 3 and Offer 4

21. Sales Channel - Each sales offer new car insurance by Agent, Call Center, Web and Branch

22. Total Claim Amount - Number of Total Claim Amount when customer did based on their coverage and other considerations.

23. Vehicle Class - Type of vehicle classes that customers have Two-Door Car, Four-Door Car SUV, Luxury SUV, Sports Car, and Luxury Car

24. Vehicle Size - Type of customers vehicle size, there are small, medium and large

head(insurance) %>%
      knitr::kable(caption = "Insurance Dataset") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Insurance Dataset
Customer	State	Customer.Lifetime.Value	Response	Coverage	Education	Effective.To.Date	EmploymentStatus	Gender	Income	Location.Code	Marital.Status	Monthly.Premium.Auto	Months.Since.Last.Claim	Months.Since.Policy.Inception	Number.of.Policies	Policy.Type	Policy	Renew.Offer.Type	Sales.Channel	Total.Claim.Amount	Vehicle.Class	Vehicle.Size
BU79786	Washington	2763.519279	No	Basic	Bachelor	2/24/11	Employed	F	56274	Suburban	Married	69	32	5	1	Corporate Auto	Corporate L3	Offer1	Agent	384.811147	Two-Door Car	Medsize
QZ44356	Arizona	6979.535903	No	Extended	Bachelor	1/31/11	Unemployed	F	0	Suburban	Single	94	13	42	8	Personal Auto	Personal L3	Offer3	Agent	1131.464935	Four-Door Car	Medsize
AI49188	Nevada	12887.43165	No	Premium	Bachelor	2/19/11	Employed	F	48767	Suburban	Married	108	18	38	2	Personal Auto	Personal L3	Offer1	Agent	566.472247	Two-Door Car	Medsize
WW63253	California	7645.861827	No	Basic	Bachelor	1/20/11	Unemployed	M	0	Suburban	Married	106	18	65	7	Corporate Auto	Corporate L2	Offer1	Call Center	529.881344	SUV	Medsize
HB64268	Washington	2813.692575	No	Basic	Bachelor	3/2/2011	Employed	M	43836	Rural	Single	73	12	44	1	Personal Auto	Personal L1	Offer1	Agent	138.130879	Four-Door Car	Medsize
OC83172	Oregon	8256.2978	Yes	Basic	Bachelor	1/25/11	Employed	F	62902	Rural	Married	69	14	94	2	Personal Auto	Personal L3	Offer2	Web	159.383042	Two-Door Car	Medsize

1.2 Preparation Variables

Above shows our head of our data and as we can see below. In this part, we did for preparation data, we just need to convert some variables as nominal and ordinal data. We change some of our variables into nominal such as:

State
Response
Employment Status
Gender
Location Code
Marital Status
Policy Type
Sales Channel
Vehicle Class

Also we change some data into ordinal likes:

Coverage
Education
Vehicle Size

Other variables likes Customer Lifetime Value, Income, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, Number of Open Complaints, Number of Policies, Total Claim Amount are numeric.

# partition data
# 70% of training data
# 30% of testing data
set.seed(987654321)
insurance_which_train <- createDataPartition(insurance$Response, p=0.7, list = FALSE)

insurance_train <- insurance[insurance_which_train,]
insurance_test <- insurance[-insurance_which_train,]


# save our dataset in file insurance_dataset files, then we can have similar train and test dataset in the future
save(list = c("insurance_train",
              "insurance_test"),
     file = "insurance_dataset.RData")

1.2 Checking Missing Values

#### Check Missing Values ####
na_insurance<-colSums(is.na(insurance_train))

na_insurance %>%
      knitr::kable(col.names="NA", caption = "Number of Missing Values") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Number of Missing Values
	NA
Customer	0
State	0
Customer.Lifetime.Value	0
Response	0
Coverage	0
Education	0
Effective.To.Date	0
EmploymentStatus	0
Gender	0
Income	0
Location.Code	0
Marital.Status	0
Monthly.Premium.Auto	0
Months.Since.Last.Claim	0
Months.Since.Policy.Inception	0
Number.of.Open.Complaints	0
Number.of.Policies	0
Policy.Type	0
Policy	0
Renew.Offer.Type	0
Sales.Channel	0
Total.Claim.Amount	0
Vehicle.Class	0
Vehicle.Size	0

Based on table above, there are no missing values also in Vehicle insurance dataset. Therefore we don’t need to use any imputation in this dataset.

1.3 Split Train and Test Dataset

However, we split our dataset into train and test dataset with proportions 70% for train dataset and 30% for test dataset. There are 6395 for train dataset and 2739 for test dataset. We can see the train dataset in 1.3. Split Train and Test Dataset. We have saved the train and test dataset.

head(insurance_train) %>%
      knitr::kable(caption = "Insurance Train Dataset") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Insurance Train Dataset
Customer	State	Customer.Lifetime.Value	Response	Coverage	Education	Effective.To.Date	EmploymentStatus	Gender	Income	Location.Code	Marital.Status	Monthly.Premium.Auto	Months.Since.Last.Claim	Months.Since.Policy.Inception	Number.of.Policies	Policy.Type	Policy	Renew.Offer.Type	Sales.Channel	Total.Claim.Amount	Vehicle.Class	Vehicle.Size
BU79786	Washington	2763.519	No	Basic	Bachelor	2/24/11	Employed	F	56274	Suburban	Married	69	32	5	1	Corporate Auto	Corporate L3	Offer1	Agent	384.8111	Two-Door Car	Medsize
QZ44356	Arizona	6979.536	No	Extended	Bachelor	1/31/11	Unemployed	F	0	Suburban	Single	94	13	42	8	Personal Auto	Personal L3	Offer3	Agent	1131.4649	Four-Door Car	Medsize
AI49188	Nevada	12887.432	No	Premium	Bachelor	2/19/11	Employed	F	48767	Suburban	Married	108	18	38	2	Personal Auto	Personal L3	Offer1	Agent	566.4722	Two-Door Car	Medsize
WW63253	California	7645.862	No	Basic	Bachelor	1/20/11	Unemployed	M	0	Suburban	Married	106	18	65	7	Corporate Auto	Corporate L2	Offer1	Call Center	529.8813	SUV	Medsize
HB64268	Washington	2813.693	No	Basic	Bachelor	3/2/2011	Employed	M	43836	Rural	Single	73	12	44	1	Personal Auto	Personal L1	Offer1	Agent	138.1309	Four-Door Car	Medsize
OC83172	Oregon	8256.298	Yes	Basic	Bachelor	1/25/11	Employed	F	62902	Rural	Married	69	14	94	2	Personal Auto	Personal L3	Offer2	Web	159.3830	Two-Door Car	Medsize

2. initial descriptive analysis of the data

In this report, we need to do some visualization and to know our data further

ggplot(data=insurance_train, aes(x=Response))+
  geom_bar( fill=c("#808080","#1E90FF")) +
  theme_minimal() + 
  theme_classic() +
  labs(title="Number of Response")

* As we can see that our independent variable is not balance, we have people will no continue to use vehicle insurance about 5479 and people will not use vehicle insurance about 916. This can be our consideration to analysis this data with resampling or feature engineering to improve our model

ggplot(insurance_train, aes(x=Response, y=Total.Claim.Amount)) + 
      geom_boxplot(fill='#A4A4A4', color="darkblue") +
  theme_minimal() +
  labs(title = "Boxplot of Customers Decision")

* Based on boxplot above, it shows that customers who have high Total Claim Amount decided not to renew their vehicle insurance but people who decide to renew their vehicle insurance mostly has got Total Claim Amount around 7.345946 and 1358.4

channel <-insurance_train %>%
          group_by(Sales.Channel) %>%
          summarise(n=n())
channel$perc <- channel$n/nrow(insurance_train)         

pie = ggplot(channel, aes(x="", y=perc, fill=Sales.Channel)) + geom_bar(stat="identity")
 
# Convert to pie (polar coordinates) and add labels
pie + coord_polar("y", start=0)+ 
        geom_text(aes(label = paste0(round(perc*100), "%")), position = position_stack(vjust = 0.5)) + 
          theme_classic() + theme(axis.line = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank(),
          plot.title = element_text(hjust = 0.5)) +
          labs(x = NULL, y = NULL, fill = NULL, title = "Sales Channel Vehicle Insurance")

Based on pie plot above, Using Agent still the best method to sell vehicle insurance, it shows that there are 38.09% customers buy vehicle insurance also Branch is the second method that company can use to sell vehicle insurance because 27.8% customers come to branch office to buy vehicle insurance. Other two Sales Channel likes Call Center and Web do not have really significant different, about 18.97% and 15.14% people still consider to buy vehicle insurance buy those two channels respectively.

new_offer <- insurance_train %>%
              group_by(Gender, Renew.Offer.Type)%>%
              summarise(n=n())

ggplot(new_offer, aes(x = Renew.Offer.Type, y = n))+
  labs(x = NULL, y = NULL, fill = NULL, title = "Offering Renew Offer Type") +
  geom_bar(
    aes(fill = Renew.Offer.Type), stat = "identity", color = "white",
    position = position_dodge(0.9)
    )+
  theme_classic() +
  facet_wrap(~Gender)

* Based on bar plot above it can be new strategy for the company, How to offer the Renew Vehicle insurance based on Gender. Nevetheless, more or less nothing different between Female and Male when they decide to renew their insurance. The difference would be in Offer type 4, it shows that Male is more interested to renew their vehicle insurance in new offer type 4 than female but we need to investigate further about this, if company would like to have new strategy based on gender

ggplot(data=insurance_train, aes(x=Response))+
  geom_bar( fill="darkgreen") +
  theme_minimal() + 
  labs(title="Response") +
  coord_flip()

response_n <- insurance_train %>%
              group_by(Response) %>%
              summarise(n())

As we see plot above, our dependent/target variable is not balance. Our customers who want to renew the vehicle insurance about 916 and customers who don’t want to renew the vehicle insurance about Yes

3. Comparing Methods

*In this part, it would like to analyse using three different methods for categorical dependent variable using methods like,

1.) Logistic Regression

2.) KNN

3.) Ridge Regression and Lasso

In all analysis, we select “Response” as our dependent variable and we drop two variables likes “Customer” and “Effective to Date” since it did not useful in analysis.

3.1. Logit Regression

In this first method, we compared two regressions, There are Logit Regression and Probit Regression. Below we can see our comparison table between Logit Regression and Probit Regression

c1 <- insurance_train_logit$coefficients
c2 <- insurance_train_probit$coefficients

c12 <- cbind(c1,c2)

c12 %>%
      knitr::kable(col.names = c("Regression Logit", "Regression Probit"),caption = "Comparison Between Logit & Probit") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Comparison Between Logit & Probit
	Regression Logit	Regression Probit
(Intercept)	-2.1238872	-1.2448839
StateCalifornia	0.1031821	0.0487939
StateNevada	0.0927564	0.0353556
StateOregon	0.1018565	0.0523793
StateWashington	0.0578533	0.0287346
Customer.Lifetime.Value	-0.0000028	-0.0000016
CoverageExtended	-0.0474214	-0.0230827
CoveragePremium	0.2626227	0.1391050
EducationBachelor	0.1333182	0.0677062
EducationCollege	0.3392410	0.1826816
EducationMaster	0.5230953	0.3298529
EducationDoctor	0.5803362	0.3169402
EmploymentStatusEmployed	-0.2400799	-0.1302629
EmploymentStatusMedical Leave	0.1347016	0.0755668
EmploymentStatusRetired	2.3285987	1.3790203
EmploymentStatusUnemployed	-0.6263815	-0.3418276
GenderM	0.0371003	0.0201790
Income	0.0000045	0.0000030
Location.CodeSuburban	1.3666499	0.7923865
Location.CodeUrban	0.1101430	0.0695286
Marital.StatusMarried	-0.5858373	-0.3227722
Marital.StatusSingle	-0.6010885	-0.3409333
Monthly.Premium.Auto	0.0003787	0.0001249
Months.Since.Last.Claim	-0.0047451	-0.0023458
Months.Since.Policy.Inception	0.0022964	0.0011636
Number.of.Open.Complaints	-0.0143820	-0.0016643
Number.of.Policies	-0.0115261	-0.0071925
Policy.TypePersonal Auto	0.0137366	0.0082449
Policy.TypeSpecial Auto	-0.1999415	-0.0837278
PolicyPersonal L2	-0.1288645	-0.0770827
PolicyPersonal L3	-0.1720342	-0.0930629
PolicySpecial L1	0.1673612	0.0587742
PolicySpecial L3	0.6399175	0.3355414
PolicySpecial L2	NA	NA
PolicyCorporate L1	0.0768275	0.0262502
PolicyCorporate L2	0.1404268	0.0640978
PolicyCorporate L3	NA	NA
Renew.Offer.TypeOffer2	0.6814189	0.3839137
Renew.Offer.TypeOffer3	-2.2073937	-1.0501764
Renew.Offer.TypeOffer4	-16.7608955	-5.7247953
Sales.ChannelBranch	-0.6435594	-0.3754183
Sales.ChannelCall Center	-0.5952724	-0.3547292
Sales.ChannelWeb	-0.6349736	-0.3525787
Total.Claim.Amount	-0.0012542	-0.0007041
Vehicle.ClassLuxury Car	-0.0995168	-0.0001644
Vehicle.ClassLuxury SUV	0.7643857	0.3962873
Vehicle.ClassSports Car	0.7089762	0.3898242
Vehicle.ClassSUV	0.5048990	0.2826464
Vehicle.ClassTwo-Door Car	0.0146988	0.0089212
Vehicle.SizeMedsize	0.3736937	0.1914199
Vehicle.SizeLarge	0.7382753	0.3968874

if we put our significance value is 5% then we got our 15 significance variable with both methods, regression logit and regression probit. There are Education L, EmploymentStatusRetired, EmploymentStatusUnemployed, Location.CodeSuburban, Marital.StatusMarried, Marital.StatusSingle, Renew.Offer.TypeOffer2, Renew.Offer.TypeOffer3, Sales.ChannelBranch, Sales.ChannelCall Center, Sales.ChannelCall Center, Sales.ChannelWeb, Total.Claim.Amount, Vehicle.ClassSports Car and Vehicle.Size.L.
But if we compare based on the AIC, we got the best method with probit, with very small different. In regression probit has AIC about 4201.4859018 but in regression logit it has AIC about 4207.3016481. But if we check based on model evaluation using their predicted value, you can check the result in table below

3.1.1 Comparing Balanced Accuracy Basic Model

#### Comparing Balanced Accuracy Basic Model ####
summary_logit <- summary_binary(real=insurance_train$Response, predicted_probs = insurance_train_logit_fitted)
summary_probit <- summary_binary(real=insurance_train$Response, predicted_probs = insurance_train_probit_fitted)
summary_results<- cbind(summary_logit,summary_probit)

summary_results %>%
      knitr::kable(col.names=c("Logit Regression","Probit Regression"), caption = "Comparing result") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Comparing result
	Logit Regression	Probit Regression
Accuracy	0.86927	0.86943
Sensitivity	0.15721	0.15284
Specificity	0.98832	0.98923
Pos Pred Value	0.69231	0.70352
Neg Pred Value	0.87522	0.87476
F1	0.25623	0.25112
Balanced Accuracy	0.57276	0.57104

Based on table above, There are no so significant difference in all parameters but since response variable is unbalanced then we need to pay attention in balanced accuracy. Both of them has similar balanced accuracy 57%. Then to get best method in logit regression, we can use feature engineering and resampling to increase the balanced accuracy. In feature engineering, we will use cross validation and in resampling methods, we will use down, up, rose and smote methods.

3.1.2 Improving probit regression

In this part, we’re going to improve our best model probit regression with feature engineering (cross validation, quantiles() and resampling (up, down, smote).

#### Best Method After Improving - Logit ####
models_all_insurance_probit <- ls(pattern ="insurance_train_probit2")

result2<-models_all_insurance_probit %>% 
  sapply(function(x) get(x) %>% 
           # use it for prediction
           # in the TRAIN sample
           predict(newdata = insurance_train) %>% 
           # apply the summary
           summary_binary_class(level_positive = "Yes",
                                level_negative = "No",
                                real = insurance_train$Response)) 

summary_results2<- cbind(summary_results, result2)

summary_results2 %>%
      knitr::kable(col.names=c("Logit Regression","Probit Regression", "insurance_train_probit2_cv", "insurance_train_probit2_down", 
                               "insurance_train_probit2_quantiles", "insurance_train_probit2_smote", "insurance_train_probit2_up"), 
                   caption = "Comparing result") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Comparing result
	Logit Regression	Probit Regression	insurance_train_probit2_cv	insurance_train_probit2_down	insurance_train_probit2_quantiles	insurance_train_probit2_smote	insurance_train_probit2_up
Accuracy	0.86927	0.86943	0.86943	0.71040	0.86943	0.78170	0.70868
Sensitivity	0.15721	0.15284	0.15284	0.76201	0.15175	0.50873	0.77838
Specificity	0.98832	0.98923	0.98923	0.70177	0.98941	0.82734	0.69703
Pos Pred Value	0.69231	0.70352	0.70352	0.29931	0.70558	0.33003	0.30046
Neg Pred Value	0.87522	0.87476	0.87476	0.94635	0.87464	0.90969	0.94953
F1	0.25623	0.25112	0.25112	0.42980	0.24978	0.40034	0.43357
Balanced Accuracy	0.57276	0.57104	0.57104	0.73189	0.57058	0.66804	0.73770

After we improved our model with resampling and feature engineering. Then it needs to merge all models together. We got the highest balanced accuracy using resampling up about 74%, also in this model sensitivity is the highest about 78%. Eventhough in this model, specificity is not the lowest about 69%. Then we will interpret the best model (with up sampling)

3.1.3 Best Method In Logit Regression

#### best method in logit regression ####
best_logit_method <- summary(insurance_train_probit2_up)
coef <- best_logit_method$coefficients

coef %>%
      knitr::kable(caption = "Best Logit & Probit Regression Results") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Best Logit & Probit Regression Results
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.1218218	0.1815636	-0.6709596	0.5022463
StateCalifornia	0.0735430	0.0402105	1.8289517	0.0674068
StateNevada	0.0558561	0.0549221	1.0170057	0.3091507
StateOregon	0.0782687	0.0417387	1.8752041	0.0607647
StateWashington	0.0255310	0.0580244	0.4400050	0.6599335
Customer.Lifetime.Value	-0.0000020	0.0000022	-0.8984364	0.3689530
CoverageExtended	-0.0050130	0.0537187	-0.0933189	0.9256502
CoveragePremium	0.2374842	0.1113627	2.1325300	0.0329633
EducationBachelor	0.0367610	0.0375603	0.9787192	0.3277188
EducationCollege	0.1854272	0.0374743	4.9481236	0.0000007
EducationMaster	0.4361578	0.0565841	7.7081343	0.0000000
EducationDoctor	0.3322843	0.0751924	4.4191224	0.0000099
EmploymentStatusEmployed	-0.0838079	0.0739979	-1.1325724	0.2573938
`EmploymentStatusMedical Leave`	0.1328668	0.0859182	1.5464346	0.1219996
EmploymentStatusRetired	1.3034502	0.1005039	12.9691511	0.0000000
EmploymentStatusUnemployed	-0.3609602	0.0738425	-4.8882476	0.0000010
GenderM	0.0509833	0.0282586	1.8041657	0.0712053
Income	0.0000036	0.0000008	4.3380746	0.0000144
Location.CodeSuburban	0.9369885	0.0579336	16.1734845	0.0000000
Location.CodeUrban	0.1049998	0.0545251	1.9257148	0.0541400
Marital.StatusMarried	-0.3665222	0.0393725	-9.3090948	0.0000000
Marital.StatusSingle	-0.4122225	0.0444598	-9.2717940	0.0000000
Monthly.Premium.Auto	-0.0017740	0.0021833	-0.8125352	0.4164846
Months.Since.Last.Claim	-0.0014869	0.0014066	-1.0570601	0.2904842
Months.Since.Policy.Inception	0.0009104	0.0005021	1.8130751	0.0698202
Number.of.Open.Complaints	0.0258598	0.0153149	1.6885337	0.0913088
Number.of.Policies	-0.0075177	0.0058589	-1.2831207	0.1994498
`Policy.TypePersonal Auto`	-0.0175437	0.0553116	-0.3171799	0.7511071
`Policy.TypeSpecial Auto`	-0.0969069	0.1185942	-0.8171300	0.4138541
`PolicyPersonal L2`	-0.0933147	0.0467180	-1.9974033	0.0457814
`PolicyPersonal L3`	-0.0907951	0.0432279	-2.1003816	0.0356953
`PolicySpecial L1`	-0.0307607	0.2072051	-0.1484555	0.8819833
`PolicySpecial L3`	0.3028675	0.1473215	2.0558266	0.0397992
`PolicyCorporate L1`	-0.0226721	0.0827581	-0.2739566	0.7841180
`PolicyCorporate L2`	0.0809377	0.0661026	1.2244245	0.2207921
Renew.Offer.TypeOffer2	0.4404619	0.0323020	13.6357557	0.0000000
Renew.Offer.TypeOffer3	-1.2440738	0.0597121	-20.8345313	0.0000000
Renew.Offer.TypeOffer4	-6.4781901	30.8175971	-0.2102107	0.8335032
Sales.ChannelBranch	-0.4435167	0.0351675	-12.6115596	0.0000000
`Sales.ChannelCall Center`	-0.4645988	0.0407413	-11.4036333	0.0000000
Sales.ChannelWeb	-0.3508603	0.0437917	-8.0120271	0.0000000
Total.Claim.Amount	-0.0007860	0.0001083	-7.2552908	0.0000000
`Vehicle.ClassLuxury Car`	0.0709991	0.3167809	0.2241268	0.8226587
`Vehicle.ClassLuxury SUV`	0.4879384	0.3074885	1.5868507	0.1125465
`Vehicle.ClassSports Car`	0.4595172	0.1101240	4.1727246	0.0000301
Vehicle.ClassSUV	0.4200763	0.0975002	4.3084675	0.0000164
`Vehicle.ClassTwo-Door Car`	0.0113326	0.0364641	0.3107863	0.7559631
Vehicle.SizeMedsize	0.1554925	0.0381214	4.0788744	0.0000453
Vehicle.SizeLarge	0.3782757	0.0535173	7.0682842	0.0000000

Based on table above, we got our significant variables (which have below 5%) likes Education College, Education Master, Education Doctor, EmploymentStatusRetired, Employment Status Unemployed, Location.CodeSuburban, Marital.StatusMarried, Marital.StatusSingle, Renew.Offer.TypeOffer2, Renew.Offer.TypeOffer3, Sales.ChannelBranch , Sales.ChannelCall Center, Sales.ChannelWeb, Total.Claim.Amount, Vehicle.ClassSports Car, Vehicle.SizeLarge.

1. Education College the increase number of customers who have background in College will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below

2. Education Master the increase number of customers who have background in Master will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below

3. Education Doctor the increase number of customers who have background in Doctor will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below

4. EmploymentStatusRetired the increase number of customers who have employment status as Retired will increase the probability consumers to renew their insurance as compared to customer who have employment status as Disabled

5. EmploymentStatusUnemployed the increase number of customers who have employment status as Unemployed will increase the probability consumers to renew their insurance as compared to customer who have employment status as Disabled

6. Location.CodeSuburban the increase number of customers who live in Suburban will increase the probability consumers to renew their insurance

7. Marital.StatusMarried the increase number of customers who have marital status as Married will decrease the probability consumers to renew their insurance as compared to customer who have married status as Divorced

8. Marital.StatusSingle the increase number of customers who have marital status as Single will decrease the probability consumers to renew their insurance as compared to customer who have married status as Divorced

9.Renew.Offer.TypeOffer2 the increase number of customers who decide renew offer type offer 2 will increase the probability consumers to renew their insurance as compared to customer who decide renew offer type 1

10.Renew.Offer.TypeOffer3 the increase number of customers who decide renew offer type offer 3 will decrease the probability consumers to renew their insurance as compared to customer who decide renew offer type 1

11.Sales.ChannelBranch the increase number of customers who have bought vehicle insurance by Branch will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent

12.Sales.ChannelCall Center the increase number of customers who have bought vehicle insurance by Call Center will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent

13.Sales.ChannelWeb the increase number of customers who have bought vehicle insurance by Web will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent

14.Total.Claim.Amount the increase number of customers who have Total Claim Amount will decrease the probability consumers to renew their insurance

15.Vehicle.ClassSports Car the increase number of customers who have Sports Car increase the probability consumers to renew their insurance as compared to customer who have Four Door Car

16.Vehicle.SizeLarge. the increase number of customers who have large vehicles increase the probability consumers to renew their insurance as compared to customer who have small vehicles

3.2. KNN

In this KNN part, we build 5 models. first model is the basic KNN, second model is with tunegrid k=80, third model is with different k, fourth model and last model are with scaled

3.2.1 Summary Result KNN

#### Summary Result KNN ####
summary_result_knn_test %>%
  knitr::kable(caption = "Best Logit KNN Results based on test dataset") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Best Logit KNN Results based on test dataset
	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	F1	Balanced Accuracy
insurace_train_knn_5	0.84155	0.81378	0.84619	0.46912	0.96455	0.59515	0.82998
insurance_train_knn_80	0.85688	0.00000	1.00000	NaN	0.85688	NA	0.50000
insurance_train_knn_tunned	0.97189	1.00000	0.96719	0.83582	1.00000	0.91057	0.98360
insurance_train_knn_scaled	0.90471	0.64796	0.94759	0.67374	0.94157	0.66060	0.79778
insurance_train_knn_scaled2	0.90507	0.64286	0.94887	0.67742	0.94085	0.65969	0.79586

Based on prediction with test dataset, the highest balanced accuracy with model knn with tunned about 98% and also it has highest sensitivity since most predictions are “No” about 5479 predictions and and “Yes” only 916 observations. the highest specificity is model with knn with k different is 80, therefore it has 100% specificity and since our roc area is not quite good. But we can do resampling, since we have unbalanced response (100% data is negative).

#### Summary Result KNN ####
summary_result_knn_test %>%
  knitr::kable(caption = "Best Logit KNN Results based on test dataset") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Best Logit KNN Results based on test dataset
	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	F1	Balanced Accuracy
insurace_train_knn_5	0.84155	0.81378	0.84619	0.46912	0.96455	0.59515	0.82998
insurance_train_knn_80	0.85688	0.00000	1.00000	NaN	0.85688	NA	0.50000
insurance_train_knn_tunned	0.97189	1.00000	0.96719	0.83582	1.00000	0.91057	0.98360
insurance_train_knn_scaled	0.90471	0.64796	0.94759	0.67374	0.94157	0.66060	0.79778
insurance_train_knn_scaled2	0.90507	0.64286	0.94887	0.67742	0.94085	0.65969	0.79586

However, we also tried to predict based on train dataset and the conclusion is the same that insurance_train_knn_tunned has the highest balanced accuracy but with train dataset is overfitting because the balanced accuracy is 100%. We would like to know what will happen if we still continue to improve our model with feature engineering and resampling.

3.2.1 Summary Result KNN

#### Summary Result KNN based on train dataset####
summary_result_knn_train %>%
  knitr::kable(caption = "Best Logit KNN Results based on train dataset") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Best Logit KNN Results based on train dataset
	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	F1	Balanced Accuracy
insurace_train_knn_5	0.90008	0.96834	0.88867	0.59252	0.99408	0.73518	0.92850
insurance_train_knn_80	0.85676	0.00000	1.00000	NaN	0.85676	NA	0.50000
insurance_train_knn_tunned	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000
insurance_train_knn_scaled	0.94464	0.81987	0.96550	0.79894	0.96975	0.80927	0.89269
insurance_train_knn_scaled2	0.94402	0.81878	0.96496	0.79618	0.96956	0.80732	0.89187

but before that, we will show our best model with KNN (based on test dataset). In this model, it also has high ROC area about 97% which mean the prediction of train dataset is very good.

#### Best Method in KNN - Scaled ####
insurance_forecasts_knn_scaled <- predict(insurance_train_knn_tunned_scaled,
                                          insurance_train, type="prob")

confusionMatrix(data=as.factor(ifelse(insurance_forecasts_knn_scaled["Yes"]> 0.5,
                                      "Yes","No")),
                reference = insurance_train$Response,
                positive="Yes")

Confusion Matrix and Statistics

   Reference

Prediction No Yes No 5306 169 Yes 173 747

        Accuracy : 0.9465             
             95% CI : (0.9407, 0.9519)   
No Information Rate : 0.8568             
P-Value [Acc > NIR] : <0.0000000000000002
                                         
              Kappa : 0.7825

Mcnemar’s Test P-Value : 0.8711

     Sensitivity : 0.8155             
        Specificity : 0.9684             
     Pos Pred Value : 0.8120             
     Neg Pred Value : 0.9691             
         Prevalence : 0.1432             
     Detection Rate : 0.1168

Detection Prevalence : 0.1439
Balanced Accuracy : 0.8920

'Positive' Class : Yes

roc.area(ifelse(insurance_train$Response == "Yes",1,0),
        insurance_forecasts_knn_scaled[,"Yes"])$A

[1] 0.9732075

3.2.2 Improving KNN

In this part, we’re going to improve our KNN with feature engineering (cross validation, quantiles() and resampling (up, down, smote).

3.2.3 The Best Model After Improving : KNN

#### Best Method After Improving - KNN ####
# this is based on train data
models_all_insurance_knn2 <- ls(pattern ="insurance_train_knn2")


summary_results_knn2<-models_all_insurance_knn2 %>% 
                  sapply(function(x) get(x) %>% 
                  predict(newdata = insurance_train) %>% 
                   summary_binary_class(level_positive = "Yes",
                                level_negative = "No",
                                real = insurance_train$Response)) %>%
                    t()


summary_results2_knn<- rbind(summary_result_knn_train,summary_results_knn2)

summary_results2_knn %>%
      knitr::kable(caption = "Best KNN Methods Results") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Best KNN Methods Results
	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	F1	Balanced Accuracy
insurace_train_knn_5	0.90008	0.96834	0.88867	0.59252	0.99408	0.73518	0.92850
insurance_train_knn_80	0.85676	0.00000	1.00000	NaN	0.85676	NA	0.50000
insurance_train_knn_tunned	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000
insurance_train_knn_scaled	0.94464	0.81987	0.96550	0.79894	0.96975	0.80927	0.89269
insurance_train_knn_scaled2	0.94402	0.81878	0.96496	0.79618	0.96956	0.80732	0.89187
insurance_train_knn2_down	0.85958	1.00000	0.83610	0.50496	1.00000	0.67106	0.91805
insurance_train_knn2_tunned_cv	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000
insurance_train_knn2_tunned_quantiles	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000
insurance_train_knn2_up	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000	1.00000

Eventhough in our KNN Methods have high balanced accuracy but the the sensitivy was very high. Then the balanced accuracy is confirm that there is suspicious thing from this model, it might be because of overfitting. Model likes insurance_train_knn2_tunned_cv, insurance_train_knn2_tunned_quantiles, and insurance_train_knn2_up has balanced accuracy about 100%. Our best KNN model here when k=1, with the accuracy 95% and ROC about 96% which is very good model. Sensitivity and Specificity about 95% and 98% respectively.

3.3 Ridge Regression and Lasso

In this ridge regression and lasso, we will analysis various type of ridge regression and lasso, we use two different parameters. For basic ridge and lasso regression we use parameter with 200 lambdas and second parameter is 10000 lambdas. However, we also include elastic parameter to get different result.

3.3.1 Comparing Best Basic Model : Ridge Regression and Lasso

#### Comparing Ridge and Lasso Regression ####
#### comparing all ridge and lasso results ####
models_all_insurance_result_lambda <- ls(pattern ="result_lambda") # lasso & ridge

#models_all_insurance_result_lambda %>% 
#  sapply(function(x) get(x) %>% 
#           # use it for prediction
#           # in the TRAIN sample
#           predict(newdata = insurance_train) %>% 
#           # apply the summary
#          summary_binary_class(level_positive = "Yes",
#                                level_negative = "No",
#                                real = insurance_train$Response)) 

models_all_insurance_result_lambda <- rbind(result_lambda_ridge,result_lambda_ridge2, result_lambda_logit, 
                                            result_lambda_lasso, result_lambda_elastic, result_lambda_elastic2)

models_all_insurance_result_lambda %>%
  knitr::kable(caption = "Comparing result") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Comparing result
	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	F1	Balanced Accuracy
result_lambda_ridge	0.87224	0.14301	0.99416	0.80368	0.87404	0.24282	0.56859
result_lambda_ridge2	0.85676	0.00000	1.00000	NaN	0.85676	NA	0.50000
result_lambda_logit	0.87162	0.14629	0.99288	0.77457	0.87432	0.24610	0.56959
result_lambda_lasso	0.87162	0.14192	0.99361	0.78788	0.87384	0.24052	0.56777
result_lambda_elastic	0.85676	0.00000	1.00000	NaN	0.85676	NA	0.50000
result_lambda_elastic2	0.87177	0.14192	0.99379	0.79268	0.87386	0.24074	0.56786

Based on that result all balanced accuracy is not quite good because most of balanced accuracies have similar value, around 50%-56%. but the highest balanced accuracy is in basic ridge regression with parameter lambda and cross validation. Then we need to improve the models to get higher balance accuracy with resampling and feature engineering.

3.3.2 Improving Models : Ridge Regression and Lasso

In this part, we’re going to improve our Ridge Regression and Lasso with feature engineering (cross validation, quantiles() and resampling (up, down, smote).

3.3.3 Comparing Best Method After Improving : Ridge Regression and Lasso

#### Best Method After Improving - Ridge and Lasso ####
models_all_insurance_lambda2 <- ls(pattern ="insurance_train_lambda2")


summary_results_lambda2<-models_all_insurance_lambda2 %>% 
                  sapply(function(x) get(x) %>% 
                  predict(newdata = insurance_train) %>% 
                   summary_binary_class(level_positive = "Yes",
                                level_negative = "No",
                                real = insurance_train$Response)) %>%
                    t()


summary_results2_lambda<- rbind(models_all_insurance_result_lambda, 
                         summary_results_lambda2)

summary_results2_lambda %>%
      knitr::kable(caption = "Best Ridge and Lasso Regression Methods Results") %>%
      kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Best Ridge and Lasso Regression Methods Results
	Accuracy	Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	F1	Balanced Accuracy
result_lambda_ridge	0.87224	0.14301	0.99416	0.80368	0.87404	0.24282	0.56859
result_lambda_ridge2	0.85676	0.00000	1.00000	NaN	0.85676	NA	0.50000
result_lambda_logit	0.87162	0.14629	0.99288	0.77457	0.87432	0.24610	0.56959
result_lambda_lasso	0.87162	0.14192	0.99361	0.78788	0.87384	0.24052	0.56777
result_lambda_elastic	0.85676	0.00000	1.00000	NaN	0.85676	NA	0.50000
result_lambda_elastic2	0.87177	0.14192	0.99379	0.79268	0.87386	0.24074	0.56786
insurance_train_lambda2_cv	0.87193	0.14192	0.99398	0.79755	0.87388	0.24096	0.56795
insurance_train_lambda2_down	0.70242	0.76638	0.69173	0.29360	0.94655	0.42455	0.72905
insurance_train_lambda2_quantiles	0.87193	0.14192	0.99398	0.79755	0.87388	0.24096	0.56795
insurance_train_lambda2_smote	0.85442	0.06114	0.98704	0.44094	0.86280	0.10738	0.52409
insurance_train_lambda2_up	0.70446	0.79258	0.68972	0.29926	0.95213	0.43447	0.74115

Based on table above, as we can see that the highest accuracy is model with up resampling about 74% and it has the highest sensitivity about 79%. but the highest specifity is model with resampling with smote. then we would analyse our best based model below

3.3.4 The Best Model : Ridge Regression and Lasso

Our best Ridge and Lasso Regressions here is Lasso Regression (resampling up) with lambda = 0.0128989 with the accuracy is 70% and balanced accuracy is 74%. and our ROC is quite high about 80% which means that our model is good. Our sensitivity about 69% and specificity is 73%.Kappa 26.5%

4. Summary and Conclusions

From all methods, as we know each method increase the balanced accuracy after Resampling implemented since our dependent variable is not balance,

1. In Logit Regression this is the best model is (insurance_train_probit2_up) with balanced accuracy 74%.
1. In KNN, (insurance_train_knn2_tunned_cv) we have better balanced accuracy than in logit regression is about 95% and ROC also very high about 96% (it means that our model is very good
1. In Ridge and Lasso Regression, our balanced accuracy is lower than KNN about 70%

if we compare based on accuracies in all models, We can say KNN is the best model for this dataset since it has the highest balanced accuracy

Machine Learning Project - Classification Project

Adinda Herawati & Mert Camur

1. Description of data and problem analyzed

1.1 Data Description

1.2 Preparation Variables

1.2 Checking Missing Values

1.3 Split Train and Test Dataset

2. initial descriptive analysis of the data

3. Comparing Methods

3.1. Logit Regression

3.1.1 Comparing Balanced Accuracy Basic Model

3.1.2 Improving probit regression

3.1.3 Best Method In Logit Regression

3.2. KNN

3.2.1 Summary Result KNN

3.2.1 Summary Result KNN

3.2.2 Improving KNN

3.2.3 The Best Model After Improving : KNN

3.3 Ridge Regression and Lasso

3.3.1 Comparing Best Basic Model : Ridge Regression and Lasso

3.3.2 Improving Models : Ridge Regression and Lasso

3.3.3 Comparing Best Method After Improving : Ridge Regression and Lasso

3.3.4 The Best Model : Ridge Regression and Lasso

4. Summary and Conclusions