1. Description of data and problem analyzed

This dataset is about 9134 customers which have taken vehicle insurance. Source of this data is from Kaggle (https://www.kaggle.com/ranja7/vehicle-insurance-customer-data). The aim of this analysis to get know whether our insurance customers will extend their vehicle insurance based on their behaviour. In this paper, it supports with many libraries which are useful for this analysis such as dplyr, ggplot, caret, glm net and etc. In this dataset we will use some variables to create our models. Below is the explanation our variables

1.1 Data Description

1. Customer - Customer ID, it is unique value

2. State - There are five location where customers live in states (Washington,Arizona, Nevada, California, Oregon)

3. Customer Lifetime Value - Value of customers insurance

4. Response - This will be our dependent variable. with categorical response “Yes” if the customers would like to renew their insurance and “No” if the customers would discontinue their insurance.

5. Coverage - There are 3 types of coverage insurances (Basic, Extended and Premium)

6. Education - Background education of customers (High School or Below, Bachelor, College, Master and Doctor)

7. Effective To Date - The first date when customer would like to actived their car insurance

8. Employment Status - Customer employemen status whether they are Employed, Unemployed, Medical Leave, Disabled, or Retired

9. Gender - F for Female and M for Male

10. Income - Customers income

11. Location Code - Where the customers live likes in Rural, Suburban, and Urban.

12. Marital Status - Customer marital status (Divorced, Married or Single)

13. Monthly Premium Auto - Premium auto that customers need to pay every month

14. Months Since Last Claim - Number of months since customers did last claim

15. Months Since Policy Inception - Number of months since customers did policy inception

16. Number of Open Complaints - Number of complaints

17. Number of Policies - Number of policies in when customers take part of car insurance

18. Policy Type - There are three type of policies in car insurance (Corporate Auto, Personal Auto, and Special Auto)

19. Policy - 3 variety of policies in insurance. There are three policies in each policy types (Corporate L3, Corporate L2, Corporate L1, Personal L3,Personal L2, Personal L1,Special L3, Special L2, Special L1)

20. Renew Offer Type - Each sales of Car Insurance offer 4 type of new insurances to customers. There are Offer 1, Offer 2, Offer 3 and Offer 4

21. Sales Channel - Each sales offer new car insurance by Agent, Call Center, Web and Branch

22. Total Claim Amount - Number of Total Claim Amount when customer did based on their coverage and other considerations.

23. Vehicle Class - Type of vehicle classes that customers have Two-Door Car, Four-Door Car SUV, Luxury SUV, Sports Car, and Luxury Car

24. Vehicle Size - Type of customers vehicle size, there are small, medium and large

Insurance Dataset
Customer State Customer.Lifetime.Value Response Coverage Education Effective.To.Date EmploymentStatus Gender Income Location.Code Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
BU79786 Washington 2763.519279 No Basic Bachelor 2/24/11 Employed F 56274 Suburban Married 69 32 5 0 1 Corporate Auto Corporate L3 Offer1 Agent 384.811147 Two-Door Car Medsize
QZ44356 Arizona 6979.535903 No Extended Bachelor 1/31/11 Unemployed F 0 Suburban Single 94 13 42 0 8 Personal Auto Personal L3 Offer3 Agent 1131.464935 Four-Door Car Medsize
AI49188 Nevada 12887.43165 No Premium Bachelor 2/19/11 Employed F 48767 Suburban Married 108 18 38 0 2 Personal Auto Personal L3 Offer1 Agent 566.472247 Two-Door Car Medsize
WW63253 California 7645.861827 No Basic Bachelor 1/20/11 Unemployed M 0 Suburban Married 106 18 65 0 7 Corporate Auto Corporate L2 Offer1 Call Center 529.881344 SUV Medsize
HB64268 Washington 2813.692575 No Basic Bachelor 3/2/2011 Employed M 43836 Rural Single 73 12 44 0 1 Personal Auto Personal L1 Offer1 Agent 138.130879 Four-Door Car Medsize
OC83172 Oregon 8256.2978 Yes Basic Bachelor 1/25/11 Employed F 62902 Rural Married 69 14 94 0 2 Personal Auto Personal L3 Offer2 Web 159.383042 Two-Door Car Medsize

1.2 Preparation Variables

Above shows our head of our data and as we can see below. In this part, we did for preparation data, we just need to convert some variables as nominal and ordinal data. We change some of our variables into nominal such as:

  • State

  • Response

  • Employment Status

  • Gender

  • Location Code

  • Marital Status

  • Policy Type

  • Sales Channel

  • Vehicle Class

Also we change some data into ordinal likes:

  • Coverage

  • Education

  • Vehicle Size

Other variables likes Customer Lifetime Value, Income, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, Number of Open Complaints, Number of Policies, Total Claim Amount are numeric.

1.2 Checking Missing Values

Number of Missing Values
NA
Customer 0
State 0
Customer.Lifetime.Value 0
Response 0
Coverage 0
Education 0
Effective.To.Date 0
EmploymentStatus 0
Gender 0
Income 0
Location.Code 0
Marital.Status 0
Monthly.Premium.Auto 0
Months.Since.Last.Claim 0
Months.Since.Policy.Inception 0
Number.of.Open.Complaints 0
Number.of.Policies 0
Policy.Type 0
Policy 0
Renew.Offer.Type 0
Sales.Channel 0
Total.Claim.Amount 0
Vehicle.Class 0
Vehicle.Size 0

Based on table above, there are no missing values also in Vehicle insurance dataset. Therefore we don’t need to use any imputation in this dataset.

1.3 Split Train and Test Dataset

However, we split our dataset into train and test dataset with proportions 70% for train dataset and 30% for test dataset. There are 6395 for train dataset and 2739 for test dataset. We can see the train dataset in 1.3. Split Train and Test Dataset. We have saved the train and test dataset.

Insurance Train Dataset
Customer State Customer.Lifetime.Value Response Coverage Education Effective.To.Date EmploymentStatus Gender Income Location.Code Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
BU79786 Washington 2763.519 No Basic Bachelor 2/24/11 Employed F 56274 Suburban Married 69 32 5 0 1 Corporate Auto Corporate L3 Offer1 Agent 384.8111 Two-Door Car Medsize
QZ44356 Arizona 6979.536 No Extended Bachelor 1/31/11 Unemployed F 0 Suburban Single 94 13 42 0 8 Personal Auto Personal L3 Offer3 Agent 1131.4649 Four-Door Car Medsize
AI49188 Nevada 12887.432 No Premium Bachelor 2/19/11 Employed F 48767 Suburban Married 108 18 38 0 2 Personal Auto Personal L3 Offer1 Agent 566.4722 Two-Door Car Medsize
WW63253 California 7645.862 No Basic Bachelor 1/20/11 Unemployed M 0 Suburban Married 106 18 65 0 7 Corporate Auto Corporate L2 Offer1 Call Center 529.8813 SUV Medsize
HB64268 Washington 2813.693 No Basic Bachelor 3/2/2011 Employed M 43836 Rural Single 73 12 44 0 1 Personal Auto Personal L1 Offer1 Agent 138.1309 Four-Door Car Medsize
OC83172 Oregon 8256.298 Yes Basic Bachelor 1/25/11 Employed F 62902 Rural Married 69 14 94 0 2 Personal Auto Personal L3 Offer2 Web 159.3830 Two-Door Car Medsize

2. initial descriptive analysis of the data

In this report, we need to do some visualization and to know our data further

* As we can see that our independent variable is not balance, we have people will no continue to use vehicle insurance about 5479 and people will not use vehicle insurance about 916. This can be our consideration to analysis this data with resampling or feature engineering to improve our model

* Based on boxplot above, it shows that customers who have high Total Claim Amount decided not to renew their vehicle insurance but people who decide to renew their vehicle insurance mostly has got Total Claim Amount around 7.345946 and 1358.4

  • Based on pie plot above, Using Agent still the best method to sell vehicle insurance, it shows that there are 38.09% customers buy vehicle insurance also Branch is the second method that company can use to sell vehicle insurance because 27.8% customers come to branch office to buy vehicle insurance. Other two Sales Channel likes Call Center and Web do not have really significant different, about 18.97% and 15.14% people still consider to buy vehicle insurance buy those two channels respectively.

* Based on bar plot above it can be new strategy for the company, How to offer the Renew Vehicle insurance based on Gender. Nevetheless, more or less nothing different between Female and Male when they decide to renew their insurance. The difference would be in Offer type 4, it shows that Male is more interested to renew their vehicle insurance in new offer type 4 than female but we need to investigate further about this, if company would like to have new strategy based on gender

  • As we see plot above, our dependent/target variable is not balance. Our customers who want to renew the vehicle insurance about 916 and customers who don’t want to renew the vehicle insurance about Yes

3. Comparing Methods

*In this part, it would like to analyse using three different methods for categorical dependent variable using methods like,

1.) Logistic Regression

2.) KNN

3.) Ridge Regression and Lasso

In all analysis, we select “Response” as our dependent variable and we drop two variables likes “Customer” and “Effective to Date” since it did not useful in analysis.

3.1. Logit Regression

In this first method, we compared two regressions, There are Logit Regression and Probit Regression. Below we can see our comparison table between Logit Regression and Probit Regression

Comparison Between Logit & Probit
Regression Logit Regression Probit
(Intercept) -2.1238872 -1.2448839
StateCalifornia 0.1031821 0.0487939
StateNevada 0.0927564 0.0353556
StateOregon 0.1018565 0.0523793
StateWashington 0.0578533 0.0287346
Customer.Lifetime.Value -0.0000028 -0.0000016
CoverageExtended -0.0474214 -0.0230827
CoveragePremium 0.2626227 0.1391050
EducationBachelor 0.1333182 0.0677062
EducationCollege 0.3392410 0.1826816
EducationMaster 0.5230953 0.3298529
EducationDoctor 0.5803362 0.3169402
EmploymentStatusEmployed -0.2400799 -0.1302629
EmploymentStatusMedical Leave 0.1347016 0.0755668
EmploymentStatusRetired 2.3285987 1.3790203
EmploymentStatusUnemployed -0.6263815 -0.3418276
GenderM 0.0371003 0.0201790
Income 0.0000045 0.0000030
Location.CodeSuburban 1.3666499 0.7923865
Location.CodeUrban 0.1101430 0.0695286
Marital.StatusMarried -0.5858373 -0.3227722
Marital.StatusSingle -0.6010885 -0.3409333
Monthly.Premium.Auto 0.0003787 0.0001249
Months.Since.Last.Claim -0.0047451 -0.0023458
Months.Since.Policy.Inception 0.0022964 0.0011636
Number.of.Open.Complaints -0.0143820 -0.0016643
Number.of.Policies -0.0115261 -0.0071925
Policy.TypePersonal Auto 0.0137366 0.0082449
Policy.TypeSpecial Auto -0.1999415 -0.0837278
PolicyPersonal L2 -0.1288645 -0.0770827
PolicyPersonal L3 -0.1720342 -0.0930629
PolicySpecial L1 0.1673612 0.0587742
PolicySpecial L3 0.6399175 0.3355414
PolicySpecial L2 NA NA
PolicyCorporate L1 0.0768275 0.0262502
PolicyCorporate L2 0.1404268 0.0640978
PolicyCorporate L3 NA NA
Renew.Offer.TypeOffer2 0.6814189 0.3839137
Renew.Offer.TypeOffer3 -2.2073937 -1.0501764
Renew.Offer.TypeOffer4 -16.7608955 -5.7247953
Sales.ChannelBranch -0.6435594 -0.3754183
Sales.ChannelCall Center -0.5952724 -0.3547292
Sales.ChannelWeb -0.6349736 -0.3525787
Total.Claim.Amount -0.0012542 -0.0007041
Vehicle.ClassLuxury Car -0.0995168 -0.0001644
Vehicle.ClassLuxury SUV 0.7643857 0.3962873
Vehicle.ClassSports Car 0.7089762 0.3898242
Vehicle.ClassSUV 0.5048990 0.2826464
Vehicle.ClassTwo-Door Car 0.0146988 0.0089212
Vehicle.SizeMedsize 0.3736937 0.1914199
Vehicle.SizeLarge 0.7382753 0.3968874
  • if we put our significance value is 5% then we got our 15 significance variable with both methods, regression logit and regression probit. There are Education L, EmploymentStatusRetired, EmploymentStatusUnemployed, Location.CodeSuburban, Marital.StatusMarried, Marital.StatusSingle, Renew.Offer.TypeOffer2, Renew.Offer.TypeOffer3, Sales.ChannelBranch, Sales.ChannelCall Center, Sales.ChannelCall Center, Sales.ChannelWeb, Total.Claim.Amount, Vehicle.ClassSports Car and Vehicle.Size.L.

  • But if we compare based on the AIC, we got the best method with probit, with very small different. In regression probit has AIC about 4201.4859018 but in regression logit it has AIC about 4207.3016481. But if we check based on model evaluation using their predicted value, you can check the result in table below

3.1.1 Comparing Balanced Accuracy Basic Model

Comparing result
Logit Regression Probit Regression
Accuracy 0.86927 0.86943
Sensitivity 0.15721 0.15284
Specificity 0.98832 0.98923
Pos Pred Value 0.69231 0.70352
Neg Pred Value 0.87522 0.87476
F1 0.25623 0.25112
Balanced Accuracy 0.57276 0.57104

Based on table above, There are no so significant difference in all parameters but since response variable is unbalanced then we need to pay attention in balanced accuracy. Both of them has similar balanced accuracy 57%. Then to get best method in logit regression, we can use feature engineering and resampling to increase the balanced accuracy. In feature engineering, we will use cross validation and in resampling methods, we will use down, up, rose and smote methods.

3.1.2 Improving probit regression

In this part, we’re going to improve our best model probit regression with feature engineering (cross validation, quantiles() and resampling (up, down, smote).

Comparing result
Logit Regression Probit Regression insurance_train_probit2_cv insurance_train_probit2_down insurance_train_probit2_quantiles insurance_train_probit2_smote insurance_train_probit2_up
Accuracy 0.86927 0.86943 0.86943 0.71040 0.86943 0.78170 0.70868
Sensitivity 0.15721 0.15284 0.15284 0.76201 0.15175 0.50873 0.77838
Specificity 0.98832 0.98923 0.98923 0.70177 0.98941 0.82734 0.69703
Pos Pred Value 0.69231 0.70352 0.70352 0.29931 0.70558 0.33003 0.30046
Neg Pred Value 0.87522 0.87476 0.87476 0.94635 0.87464 0.90969 0.94953
F1 0.25623 0.25112 0.25112 0.42980 0.24978 0.40034 0.43357
Balanced Accuracy 0.57276 0.57104 0.57104 0.73189 0.57058 0.66804 0.73770

After we improved our model with resampling and feature engineering. Then it needs to merge all models together. We got the highest balanced accuracy using resampling up about 74%, also in this model sensitivity is the highest about 78%. Eventhough in this model, specificity is not the lowest about 69%. Then we will interpret the best model (with up sampling)

3.1.3 Best Method In Logit Regression

Best Logit & Probit Regression Results
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1218218 0.1815636 -0.6709596 0.5022463
StateCalifornia 0.0735430 0.0402105 1.8289517 0.0674068
StateNevada 0.0558561 0.0549221 1.0170057 0.3091507
StateOregon 0.0782687 0.0417387 1.8752041 0.0607647
StateWashington 0.0255310 0.0580244 0.4400050 0.6599335
Customer.Lifetime.Value -0.0000020 0.0000022 -0.8984364 0.3689530
CoverageExtended -0.0050130 0.0537187 -0.0933189 0.9256502
CoveragePremium 0.2374842 0.1113627 2.1325300 0.0329633
EducationBachelor 0.0367610 0.0375603 0.9787192 0.3277188
EducationCollege 0.1854272 0.0374743 4.9481236 0.0000007
EducationMaster 0.4361578 0.0565841 7.7081343 0.0000000
EducationDoctor 0.3322843 0.0751924 4.4191224 0.0000099
EmploymentStatusEmployed -0.0838079 0.0739979 -1.1325724 0.2573938
EmploymentStatusMedical Leave 0.1328668 0.0859182 1.5464346 0.1219996
EmploymentStatusRetired 1.3034502 0.1005039 12.9691511 0.0000000
EmploymentStatusUnemployed -0.3609602 0.0738425 -4.8882476 0.0000010
GenderM 0.0509833 0.0282586 1.8041657 0.0712053
Income 0.0000036 0.0000008 4.3380746 0.0000144
Location.CodeSuburban 0.9369885 0.0579336 16.1734845 0.0000000
Location.CodeUrban 0.1049998 0.0545251 1.9257148 0.0541400
Marital.StatusMarried -0.3665222 0.0393725 -9.3090948 0.0000000
Marital.StatusSingle -0.4122225 0.0444598 -9.2717940 0.0000000
Monthly.Premium.Auto -0.0017740 0.0021833 -0.8125352 0.4164846
Months.Since.Last.Claim -0.0014869 0.0014066 -1.0570601 0.2904842
Months.Since.Policy.Inception 0.0009104 0.0005021 1.8130751 0.0698202
Number.of.Open.Complaints 0.0258598 0.0153149 1.6885337 0.0913088
Number.of.Policies -0.0075177 0.0058589 -1.2831207 0.1994498
Policy.TypePersonal Auto -0.0175437 0.0553116 -0.3171799 0.7511071
Policy.TypeSpecial Auto -0.0969069 0.1185942 -0.8171300 0.4138541
PolicyPersonal L2 -0.0933147 0.0467180 -1.9974033 0.0457814
PolicyPersonal L3 -0.0907951 0.0432279 -2.1003816 0.0356953
PolicySpecial L1 -0.0307607 0.2072051 -0.1484555 0.8819833
PolicySpecial L3 0.3028675 0.1473215 2.0558266 0.0397992
PolicyCorporate L1 -0.0226721 0.0827581 -0.2739566 0.7841180
PolicyCorporate L2 0.0809377 0.0661026 1.2244245 0.2207921
Renew.Offer.TypeOffer2 0.4404619 0.0323020 13.6357557 0.0000000
Renew.Offer.TypeOffer3 -1.2440738 0.0597121 -20.8345313 0.0000000
Renew.Offer.TypeOffer4 -6.4781901 30.8175971 -0.2102107 0.8335032
Sales.ChannelBranch -0.4435167 0.0351675 -12.6115596 0.0000000
Sales.ChannelCall Center -0.4645988 0.0407413 -11.4036333 0.0000000
Sales.ChannelWeb -0.3508603 0.0437917 -8.0120271 0.0000000
Total.Claim.Amount -0.0007860 0.0001083 -7.2552908 0.0000000
Vehicle.ClassLuxury Car 0.0709991 0.3167809 0.2241268 0.8226587
Vehicle.ClassLuxury SUV 0.4879384 0.3074885 1.5868507 0.1125465
Vehicle.ClassSports Car 0.4595172 0.1101240 4.1727246 0.0000301
Vehicle.ClassSUV 0.4200763 0.0975002 4.3084675 0.0000164
Vehicle.ClassTwo-Door Car 0.0113326 0.0364641 0.3107863 0.7559631
Vehicle.SizeMedsize 0.1554925 0.0381214 4.0788744 0.0000453
Vehicle.SizeLarge 0.3782757 0.0535173 7.0682842 0.0000000

Based on table above, we got our significant variables (which have below 5%) likes Education College, Education Master, Education Doctor, EmploymentStatusRetired, Employment Status Unemployed, Location.CodeSuburban, Marital.StatusMarried, Marital.StatusSingle, Renew.Offer.TypeOffer2, Renew.Offer.TypeOffer3, Sales.ChannelBranch , Sales.ChannelCall Center, Sales.ChannelWeb, Total.Claim.Amount, Vehicle.ClassSports Car, Vehicle.SizeLarge.

1. Education College the increase number of customers who have background in College will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below

2. Education Master the increase number of customers who have background in Master will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below

3. Education Doctor the increase number of customers who have background in Doctor will increase the probability consumers to renew their insurance as compared to customer who have background in High School or below

4. EmploymentStatusRetired the increase number of customers who have employment status as Retired will increase the probability consumers to renew their insurance as compared to customer who have employment status as Disabled

5. EmploymentStatusUnemployed the increase number of customers who have employment status as Unemployed will increase the probability consumers to renew their insurance as compared to customer who have employment status as Disabled

6. Location.CodeSuburban the increase number of customers who live in Suburban will increase the probability consumers to renew their insurance

7. Marital.StatusMarried the increase number of customers who have marital status as Married will decrease the probability consumers to renew their insurance as compared to customer who have married status as Divorced

8. Marital.StatusSingle the increase number of customers who have marital status as Single will decrease the probability consumers to renew their insurance as compared to customer who have married status as Divorced

9.Renew.Offer.TypeOffer2 the increase number of customers who decide renew offer type offer 2 will increase the probability consumers to renew their insurance as compared to customer who decide renew offer type 1

10.Renew.Offer.TypeOffer3 the increase number of customers who decide renew offer type offer 3 will decrease the probability consumers to renew their insurance as compared to customer who decide renew offer type 1

11.Sales.ChannelBranch the increase number of customers who have bought vehicle insurance by Branch will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent

12.Sales.ChannelCall Center the increase number of customers who have bought vehicle insurance by Call Center will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent

13.Sales.ChannelWeb the increase number of customers who have bought vehicle insurance by Web will decrease the probability consumers to renew their insurance as compared to customer have bought vehicle insurance by Agent

14.Total.Claim.Amount the increase number of customers who have Total Claim Amount will decrease the probability consumers to renew their insurance

15.Vehicle.ClassSports Car the increase number of customers who have Sports Car increase the probability consumers to renew their insurance as compared to customer who have Four Door Car

16.Vehicle.SizeLarge. the increase number of customers who have large vehicles increase the probability consumers to renew their insurance as compared to customer who have small vehicles

3.2. KNN

In this KNN part, we build 5 models. first model is the basic KNN, second model is with tunegrid k=80, third model is with different k, fourth model and last model are with scaled

3.2.1 Summary Result KNN

Best Logit KNN Results based on test dataset
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Balanced Accuracy
insurace_train_knn_5 0.84155 0.81378 0.84619 0.46912 0.96455 0.59515 0.82998
insurance_train_knn_80 0.85688 0.00000 1.00000 NaN 0.85688 NA 0.50000
insurance_train_knn_tunned 0.97189 1.00000 0.96719 0.83582 1.00000 0.91057 0.98360
insurance_train_knn_scaled 0.90471 0.64796 0.94759 0.67374 0.94157 0.66060 0.79778
insurance_train_knn_scaled2 0.90507 0.64286 0.94887 0.67742 0.94085 0.65969 0.79586

Based on prediction with test dataset, the highest balanced accuracy with model knn with tunned about 98% and also it has highest sensitivity since most predictions are “No” about 5479 predictions and and “Yes” only 916 observations. the highest specificity is model with knn with k different is 80, therefore it has 100% specificity and since our roc area is not quite good. But we can do resampling, since we have unbalanced response (100% data is negative).

Best Logit KNN Results based on test dataset
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Balanced Accuracy
insurace_train_knn_5 0.84155 0.81378 0.84619 0.46912 0.96455 0.59515 0.82998
insurance_train_knn_80 0.85688 0.00000 1.00000 NaN 0.85688 NA 0.50000
insurance_train_knn_tunned 0.97189 1.00000 0.96719 0.83582 1.00000 0.91057 0.98360
insurance_train_knn_scaled 0.90471 0.64796 0.94759 0.67374 0.94157 0.66060 0.79778
insurance_train_knn_scaled2 0.90507 0.64286 0.94887 0.67742 0.94085 0.65969 0.79586

However, we also tried to predict based on train dataset and the conclusion is the same that insurance_train_knn_tunned has the highest balanced accuracy but with train dataset is overfitting because the balanced accuracy is 100%. We would like to know what will happen if we still continue to improve our model with feature engineering and resampling.

3.2.1 Summary Result KNN

Best Logit KNN Results based on train dataset
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Balanced Accuracy
insurace_train_knn_5 0.90008 0.96834 0.88867 0.59252 0.99408 0.73518 0.92850
insurance_train_knn_80 0.85676 0.00000 1.00000 NaN 0.85676 NA 0.50000
insurance_train_knn_tunned 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
insurance_train_knn_scaled 0.94464 0.81987 0.96550 0.79894 0.96975 0.80927 0.89269
insurance_train_knn_scaled2 0.94402 0.81878 0.96496 0.79618 0.96956 0.80732 0.89187

but before that, we will show our best model with KNN (based on test dataset). In this model, it also has high ROC area about 97% which mean the prediction of train dataset is very good.

Confusion Matrix and Statistics

   Reference

Prediction No Yes No 5306 169 Yes 173 747

        Accuracy : 0.9465             
             95% CI : (0.9407, 0.9519)   
No Information Rate : 0.8568             
P-Value [Acc > NIR] : <0.0000000000000002
                                         
              Kappa : 0.7825             
                                         

Mcnemar’s Test P-Value : 0.8711

     Sensitivity : 0.8155             
        Specificity : 0.9684             
     Pos Pred Value : 0.8120             
     Neg Pred Value : 0.9691             
         Prevalence : 0.1432             
     Detection Rate : 0.1168             

Detection Prevalence : 0.1439
Balanced Accuracy : 0.8920

'Positive' Class : Yes                
                                         

[1] 0.9732075

3.2.2 Improving KNN

In this part, we’re going to improve our KNN with feature engineering (cross validation, quantiles() and resampling (up, down, smote).

3.2.3 The Best Model After Improving : KNN

Best KNN Methods Results
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Balanced Accuracy
insurace_train_knn_5 0.90008 0.96834 0.88867 0.59252 0.99408 0.73518 0.92850
insurance_train_knn_80 0.85676 0.00000 1.00000 NaN 0.85676 NA 0.50000
insurance_train_knn_tunned 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
insurance_train_knn_scaled 0.94464 0.81987 0.96550 0.79894 0.96975 0.80927 0.89269
insurance_train_knn_scaled2 0.94402 0.81878 0.96496 0.79618 0.96956 0.80732 0.89187
insurance_train_knn2_down 0.85958 1.00000 0.83610 0.50496 1.00000 0.67106 0.91805
insurance_train_knn2_tunned_cv 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
insurance_train_knn2_tunned_quantiles 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
insurance_train_knn2_up 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000

Eventhough in our KNN Methods have high balanced accuracy but the the sensitivy was very high. Then the balanced accuracy is confirm that there is suspicious thing from this model, it might be because of overfitting. Model likes insurance_train_knn2_tunned_cv, insurance_train_knn2_tunned_quantiles, and insurance_train_knn2_up has balanced accuracy about 100%. Our best KNN model here when k=1, with the accuracy 95% and ROC about 96% which is very good model. Sensitivity and Specificity about 95% and 98% respectively.

3.3 Ridge Regression and Lasso

In this ridge regression and lasso, we will analysis various type of ridge regression and lasso, we use two different parameters. For basic ridge and lasso regression we use parameter with 200 lambdas and second parameter is 10000 lambdas. However, we also include elastic parameter to get different result.

3.3.1 Comparing Best Basic Model : Ridge Regression and Lasso

Comparing result
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Balanced Accuracy
result_lambda_ridge 0.87224 0.14301 0.99416 0.80368 0.87404 0.24282 0.56859
result_lambda_ridge2 0.85676 0.00000 1.00000 NaN 0.85676 NA 0.50000
result_lambda_logit 0.87162 0.14629 0.99288 0.77457 0.87432 0.24610 0.56959
result_lambda_lasso 0.87162 0.14192 0.99361 0.78788 0.87384 0.24052 0.56777
result_lambda_elastic 0.85676 0.00000 1.00000 NaN 0.85676 NA 0.50000
result_lambda_elastic2 0.87177 0.14192 0.99379 0.79268 0.87386 0.24074 0.56786

Based on that result all balanced accuracy is not quite good because most of balanced accuracies have similar value, around 50%-56%. but the highest balanced accuracy is in basic ridge regression with parameter lambda and cross validation. Then we need to improve the models to get higher balance accuracy with resampling and feature engineering.

3.3.2 Improving Models : Ridge Regression and Lasso

In this part, we’re going to improve our Ridge Regression and Lasso with feature engineering (cross validation, quantiles() and resampling (up, down, smote).

3.3.3 Comparing Best Method After Improving : Ridge Regression and Lasso

Best Ridge and Lasso Regression Methods Results
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value F1 Balanced Accuracy
result_lambda_ridge 0.87224 0.14301 0.99416 0.80368 0.87404 0.24282 0.56859
result_lambda_ridge2 0.85676 0.00000 1.00000 NaN 0.85676 NA 0.50000
result_lambda_logit 0.87162 0.14629 0.99288 0.77457 0.87432 0.24610 0.56959
result_lambda_lasso 0.87162 0.14192 0.99361 0.78788 0.87384 0.24052 0.56777
result_lambda_elastic 0.85676 0.00000 1.00000 NaN 0.85676 NA 0.50000
result_lambda_elastic2 0.87177 0.14192 0.99379 0.79268 0.87386 0.24074 0.56786
insurance_train_lambda2_cv 0.87193 0.14192 0.99398 0.79755 0.87388 0.24096 0.56795
insurance_train_lambda2_down 0.70242 0.76638 0.69173 0.29360 0.94655 0.42455 0.72905
insurance_train_lambda2_quantiles 0.87193 0.14192 0.99398 0.79755 0.87388 0.24096 0.56795
insurance_train_lambda2_smote 0.85442 0.06114 0.98704 0.44094 0.86280 0.10738 0.52409
insurance_train_lambda2_up 0.70446 0.79258 0.68972 0.29926 0.95213 0.43447 0.74115

Based on table above, as we can see that the highest accuracy is model with up resampling about 74% and it has the highest sensitivity about 79%. but the highest specifity is model with resampling with smote. then we would analyse our best based model below

3.3.4 The Best Model : Ridge Regression and Lasso

Our best Ridge and Lasso Regressions here is Lasso Regression (resampling up) with lambda = 0.0128989 with the accuracy is 70% and balanced accuracy is 74%. and our ROC is quite high about 80% which means that our model is good. Our sensitivity about 69% and specificity is 73%.Kappa 26.5%

4. Summary and Conclusions

From all methods, as we know each method increase the balanced accuracy after Resampling implemented since our dependent variable is not balance,

    1. In Logit Regression this is the best model is (insurance_train_probit2_up) with balanced accuracy 74%.
    1. In KNN, (insurance_train_knn2_tunned_cv) we have better balanced accuracy than in logit regression is about 95% and ROC also very high about 96% (it means that our model is very good
    1. In Ridge and Lasso Regression, our balanced accuracy is lower than KNN about 70%

if we compare based on accuracies in all models, We can say KNN is the best model for this dataset since it has the highest balanced accuracy