INTRODUCTION

A car insurance claim classifier is a business analytics tool designed to analyze and categorize car insurance claims based on various parameters. It utilizes advanced machine learning techniques to automatically process and understand the contents of claim documents, such as accident reports, repair estimates, and customer statements. The primary use case for a car insurance claim classifier is to streamline and optimize the claims management process.

DATA UNDERSTANDING

dataset <- read.csv("data_input/Car_Insurance_Claim.csv", stringsAsFactors = T)
head(dataset)
str(dataset)
## 'data.frame':    10000 obs. of  19 variables:
##  $ ID                 : int  569520 750365 199901 478866 731664 877557 930134 461006 68366 445911 ...
##  $ AGE                : Factor w/ 4 levels "16-25","26-39",..: 4 1 1 1 2 3 4 2 3 3 ...
##  $ GENDER             : Factor w/ 2 levels "female","male": 1 2 1 2 2 1 2 1 1 1 ...
##  $ RACE               : Factor w/ 2 levels "majority","minority": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DRIVING_EXPERIENCE : Factor w/ 4 levels "0-9y","10-19y",..: 1 1 1 1 2 3 4 1 3 1 ...
##  $ EDUCATION          : Factor w/ 3 levels "high school",..: 1 2 1 3 2 1 1 3 3 1 ...
##  $ INCOME             : Factor w/ 4 levels "middle class",..: 3 2 4 4 4 3 3 4 4 3 ...
##  $ CREDIT_SCORE       : num  0.629 0.358 0.493 0.206 0.388 ...
##  $ VEHICLE_OWNERSHIP  : num  1 0 1 1 1 1 0 0 0 1 ...
##  $ VEHICLE_YEAR       : Factor w/ 2 levels "after 2015","before 2015": 1 2 2 2 2 1 1 1 2 2 ...
##  $ MARRIED            : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ CHILDREN           : num  1 0 0 1 0 1 1 1 0 1 ...
##  $ POSTAL_CODE        : int  10238 10238 10238 32765 32765 10238 10238 10238 10238 32765 ...
##  $ ANNUAL_MILEAGE     : num  12000 16000 11000 11000 12000 13000 13000 14000 13000 11000 ...
##  $ VEHICLE_TYPE       : Factor w/ 2 levels "sedan","sports car": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPEEDING_VIOLATIONS: int  0 0 0 0 2 3 7 0 0 0 ...
##  $ DUIS               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PAST_ACCIDENTS     : int  0 0 0 0 1 3 3 0 0 0 ...
##  $ OUTCOME            : num  0 1 0 0 1 0 0 1 0 1 ...

Numeric Predictor: * CREDIT_SCORE , VEHICLE_OWNERSHIP , MARRIED , CHILDREN , POSTAL_CODE , ANNUAL_MILEAGE , SPEEDING_VIOLATION , DUIS , PAST_ACCIDENTS

Categorical Predictor: * AGE , GENDER , RACE , DRIVING_EXPERIENCE , EDUCATION , INCOME , VEHICLE_YEAR , VEHICLE_TYPE

inappropriate columns datatype: - OUTCOME should be changed into factor because the value is 0 and 1 (representing no or yes) - MARRIED should be changed into factor because the value is 0 and 1 (representing no or yes) - POSTAL_CODE should be changed into factor because the value is repeating and only has 4 unique values

in this process of creating the classification models, we might think that we do not need ID columns because it does not have any relation

car_claim <- dataset %>% 
  select(-ID) %>% 
  mutate_at(vars(OUTCOME,MARRIED,POSTAL_CODE), as.factor)

CHECKING NA AND DUPLICATES VALUES

# CHECKING THE N/A VALUES
colSums(is.na(car_claim))
##                 AGE              GENDER                RACE  DRIVING_EXPERIENCE 
##                   0                   0                   0                   0 
##           EDUCATION              INCOME        CREDIT_SCORE   VEHICLE_OWNERSHIP 
##                   0                   0                 982                   0 
##        VEHICLE_YEAR             MARRIED            CHILDREN         POSTAL_CODE 
##                   0                   0                   0                   0 
##      ANNUAL_MILEAGE        VEHICLE_TYPE SPEEDING_VIOLATIONS                DUIS 
##                 957                   0                   0                   0 
##      PAST_ACCIDENTS             OUTCOME 
##                   0                   0
# DROPPING ALL ROWS THAT HAS N/A VALUES
car_claim_clean <- drop_na(car_claim)

# CHECKING THE DUPLICATE VALUES
sum(duplicated(car_claim_clean))
## [1] 0

💡 Insight:
as we can see from the code above, in this dataframe there are:
1. 982 missing values on columns CREDIT_SCORE out of 10.000 data
2. 957 missing values on columns ANNUAL_MILEAGE out of 10.000 data
3. For this occassion we would like to drop all the rows with N/A 4. After we drop all the rows with N/A, we check on duplicate values and we found no duplicate values

DEFINING PREDICTORS AND TARGET

Our goal is to create a model which can distinguished between customer who would likely to make a claim based on several parameters.
So, our target would be column OUTCOME and the rest of columns are predictors.
We split the dataset into data_train and data_test with proportion 80:20


RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(203)

# index sampling 
# ---- mengambil 80 persen index dari jumlah data -----
index <- sample(x = nrow(car_claim_clean),size = 0.8*nrow(car_claim_clean)) 

# splitting data 
data_train <- car_claim_clean[index,] #ambil data 80%
data_test  <- car_claim_clean[-index,]  #ambil data 20%

EXPLORATORY DATA ANALYSIS

CHECKING CLASS IMBALANCE

For this base model, we choose not to handle the class-imbalance for later on we would compare this model with a handled class-imbalance model.

table(data_train$OUTCOME) %>% 
  prop.table()
## 
##       0       1 
## 0.68262 0.31738

MODELING

BASE MODEL

For Base Model, we would like to create model that includes all the predictor

model_claim <- glm(formula = OUTCOME ~ . ,
                   data = data_train,
                   family = "binomial")

summary(model_claim)
## 
## Call:
## glm(formula = OUTCOME ~ ., family = "binomial", data = data_train)
## 
## Coefficients:
##                              Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)               -2.13948519   0.39422826  -5.427    0.000000057302257
## AGE26-39                  -0.03632486   0.13305306  -0.273             0.784845
## AGE40-64                  -0.17728324   0.15748537  -1.126             0.260287
## AGE65+                    -0.09566942   0.19535964  -0.490             0.624340
## GENDERmale                 1.01180583   0.08562021  11.817 < 0.0000000000000002
## RACEminority              -0.06882417   0.12410914  -0.555             0.579206
## DRIVING_EXPERIENCE10-19y  -2.13157396   0.13302536 -16.024 < 0.0000000000000002
## DRIVING_EXPERIENCE20-29y  -4.13218544   0.24116869 -17.134 < 0.0000000000000002
## DRIVING_EXPERIENCE30y+    -5.45895788   0.51745200 -10.550 < 0.0000000000000002
## EDUCATIONnone              0.07812510   0.10885028   0.718             0.472924
## EDUCATIONuniversity       -0.00028646   0.09919979  -0.003             0.997696
## INCOMEpoverty             -0.08123366   0.15664331  -0.519             0.604047
## INCOMEupper class          0.02611490   0.13276479   0.197             0.844062
## INCOMEworking class       -0.03926957   0.12748790  -0.308             0.758063
## CREDIT_SCORE               0.35892098   0.43391080   0.827             0.408137
## VEHICLE_OWNERSHIP         -2.07017946   0.09178648 -22.554 < 0.0000000000000002
## VEHICLE_YEARbefore 2015    2.08473345   0.11250188  18.531 < 0.0000000000000002
## MARRIED1                  -0.31365822   0.09386823  -3.341             0.000833
## CHILDREN                  -0.09340746   0.09403745  -0.993             0.320563
## POSTAL_CODE21217          21.37087977 177.46346479   0.120             0.904147
## POSTAL_CODE32765           1.30336770   0.10774385  12.097 < 0.0000000000000002
## POSTAL_CODE92101           1.38611233   0.18829123   7.362    0.000000000000182
## ANNUAL_MILEAGE             0.00013849   0.00001891   7.323    0.000000000000242
## VEHICLE_TYPEsports car    -0.02118414   0.18218206  -0.116             0.907431
## SPEEDING_VIOLATIONS        0.04225682   0.03458996   1.222             0.221840
## DUIS                       0.14809080   0.10128250   1.462             0.143699
## PAST_ACCIDENTS            -0.06591079   0.04921493  -1.339             0.180491
##                             
## (Intercept)              ***
## AGE26-39                    
## AGE40-64                    
## AGE65+                      
## GENDERmale               ***
## RACEminority                
## DRIVING_EXPERIENCE10-19y ***
## DRIVING_EXPERIENCE20-29y ***
## DRIVING_EXPERIENCE30y+   ***
## EDUCATIONnone               
## EDUCATIONuniversity         
## INCOMEpoverty               
## INCOMEupper class           
## INCOMEworking class         
## CREDIT_SCORE                
## VEHICLE_OWNERSHIP        ***
## VEHICLE_YEARbefore 2015  ***
## MARRIED1                 ***
## CHILDREN                    
## POSTAL_CODE21217            
## POSTAL_CODE32765         ***
## POSTAL_CODE92101         ***
## ANNUAL_MILEAGE           ***
## VEHICLE_TYPEsports car      
## SPEEDING_VIOLATIONS         
## DUIS                        
## PAST_ACCIDENTS              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8147.2  on 6518  degrees of freedom
## Residual deviance: 4148.2  on 6492  degrees of freedom
## AIC: 4202.2
## 
## Number of Fisher Scoring iterations: 15

💡 Insight:

  • GENDERmale , DRIVING_EXPERIENCE10-19y , DRIVING_EXPERIENCE20-29 , DRIVING_EXPERIENCE30y+ , VEHICLE_OWNERSHIP , VEHICLE_YEARbefore 2015 , MARRIED1 , POSTAL_CODE32765 , POSTAL_CODE92101 and ANNUAL_MILEAGE are the significant predictor

NULL MODEL

model_claim_null <- glm(formula = OUTCOME ~ 1 ,
                   data = data_train,
                   family = "binomial")

summary(model_claim_null)
## 
## Call:
## glm(formula = OUTCOME ~ 1, family = "binomial", data = data_train)
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -0.76584    0.02661  -28.78 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8147.2  on 6518  degrees of freedom
## Residual deviance: 8147.2  on 6518  degrees of freedom
## AIC: 8149.2
## 
## Number of Fisher Scoring iterations: 4

STEP-WISE MODEL

Model Backward

model_backward <- step(object = model_claim,
                       scope = model_claim_null,
                       direction = "backward")
## Start:  AIC=4202.23
## OUTCOME ~ AGE + GENDER + RACE + DRIVING_EXPERIENCE + EDUCATION + 
##     INCOME + CREDIT_SCORE + VEHICLE_OWNERSHIP + VEHICLE_YEAR + 
##     MARRIED + CHILDREN + POSTAL_CODE + ANNUAL_MILEAGE + VEHICLE_TYPE + 
##     SPEEDING_VIOLATIONS + DUIS + PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - INCOME               3   4148.5 4196.5
## - AGE                  3   4149.8 4197.8
## - EDUCATION            2   4148.8 4198.8
## - VEHICLE_TYPE         1   4148.2 4200.2
## - RACE                 1   4148.5 4200.5
## - CREDIT_SCORE         1   4148.9 4200.9
## - CHILDREN             1   4149.2 4201.2
## - SPEEDING_VIOLATIONS  1   4149.7 4201.7
## - PAST_ACCIDENTS       1   4150.0 4202.0
## <none>                     4148.2 4202.2
## - DUIS                 1   4150.3 4202.3
## - MARRIED              1   4159.4 4211.4
## - ANNUAL_MILEAGE       1   4203.0 4255.0
## - GENDER               1   4294.4 4346.4
## - VEHICLE_YEAR         1   4568.2 4620.2
## - DRIVING_EXPERIENCE   3   4618.3 4666.3
## - VEHICLE_OWNERSHIP    1   4731.9 4783.9
## - POSTAL_CODE          3   4791.9 4839.9
## 
## Step:  AIC=4196.54
## OUTCOME ~ AGE + GENDER + RACE + DRIVING_EXPERIENCE + EDUCATION + 
##     CREDIT_SCORE + VEHICLE_OWNERSHIP + VEHICLE_YEAR + MARRIED + 
##     CHILDREN + POSTAL_CODE + ANNUAL_MILEAGE + VEHICLE_TYPE + 
##     SPEEDING_VIOLATIONS + DUIS + PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - AGE                  3   4149.9 4191.9
## - EDUCATION            2   4148.9 4192.9
## - VEHICLE_TYPE         1   4148.6 4194.6
## - RACE                 1   4148.8 4194.8
## - CHILDREN             1   4149.5 4195.5
## - SPEEDING_VIOLATIONS  1   4150.0 4196.0
## - PAST_ACCIDENTS       1   4150.3 4196.3
## - CREDIT_SCORE         1   4150.5 4196.5
## <none>                     4148.5 4196.5
## - DUIS                 1   4150.7 4196.7
## - MARRIED              1   4159.4 4205.4
## - ANNUAL_MILEAGE       1   4203.2 4249.2
## - GENDER               1   4296.8 4342.8
## - VEHICLE_YEAR         1   4580.1 4626.1
## - DRIVING_EXPERIENCE   3   4620.7 4662.7
## - VEHICLE_OWNERSHIP    1   4757.4 4803.4
## - POSTAL_CODE          3   4792.6 4834.6
## 
## Step:  AIC=4191.88
## OUTCOME ~ GENDER + RACE + DRIVING_EXPERIENCE + EDUCATION + CREDIT_SCORE + 
##     VEHICLE_OWNERSHIP + VEHICLE_YEAR + MARRIED + CHILDREN + POSTAL_CODE + 
##     ANNUAL_MILEAGE + VEHICLE_TYPE + SPEEDING_VIOLATIONS + DUIS + 
##     PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - EDUCATION            2   4150.2 4188.2
## - VEHICLE_TYPE         1   4149.9 4189.9
## - RACE                 1   4150.2 4190.2
## - CHILDREN             1   4151.3 4191.3
## - CREDIT_SCORE         1   4151.3 4191.3
## - SPEEDING_VIOLATIONS  1   4151.5 4191.5
## - PAST_ACCIDENTS       1   4151.6 4191.6
## <none>                     4149.9 4191.9
## - DUIS                 1   4152.0 4192.0
## - MARRIED              1   4161.8 4201.8
## - ANNUAL_MILEAGE       1   4205.1 4245.1
## - GENDER               1   4297.7 4337.7
## - VEHICLE_YEAR         1   4588.2 4628.2
## - VEHICLE_OWNERSHIP    1   4762.8 4802.8
## - DRIVING_EXPERIENCE   3   4769.8 4805.8
## - POSTAL_CODE          3   4795.4 4831.4
## 
## Step:  AIC=4188.18
## OUTCOME ~ GENDER + RACE + DRIVING_EXPERIENCE + CREDIT_SCORE + 
##     VEHICLE_OWNERSHIP + VEHICLE_YEAR + MARRIED + CHILDREN + POSTAL_CODE + 
##     ANNUAL_MILEAGE + VEHICLE_TYPE + SPEEDING_VIOLATIONS + DUIS + 
##     PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - VEHICLE_TYPE         1   4150.2 4186.2
## - RACE                 1   4150.5 4186.5
## - CREDIT_SCORE         1   4151.5 4187.5
## - CHILDREN             1   4151.5 4187.5
## - SPEEDING_VIOLATIONS  1   4151.8 4187.8
## - PAST_ACCIDENTS       1   4151.9 4187.9
## <none>                     4150.2 4188.2
## - DUIS                 1   4152.4 4188.4
## - MARRIED              1   4162.2 4198.2
## - ANNUAL_MILEAGE       1   4205.4 4241.4
## - GENDER               1   4300.4 4336.4
## - VEHICLE_YEAR         1   4593.5 4629.5
## - DRIVING_EXPERIENCE   3   4770.1 4802.1
## - VEHICLE_OWNERSHIP    1   4772.1 4808.1
## - POSTAL_CODE          3   4795.6 4827.6
## 
## Step:  AIC=4186.21
## OUTCOME ~ GENDER + RACE + DRIVING_EXPERIENCE + CREDIT_SCORE + 
##     VEHICLE_OWNERSHIP + VEHICLE_YEAR + MARRIED + CHILDREN + POSTAL_CODE + 
##     ANNUAL_MILEAGE + SPEEDING_VIOLATIONS + DUIS + PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - RACE                 1   4150.5 4184.5
## - CREDIT_SCORE         1   4151.5 4185.5
## - CHILDREN             1   4151.6 4185.6
## - SPEEDING_VIOLATIONS  1   4151.8 4185.8
## - PAST_ACCIDENTS       1   4151.9 4185.9
## <none>                     4150.2 4186.2
## - DUIS                 1   4152.4 4186.4
## - MARRIED              1   4162.3 4196.3
## - ANNUAL_MILEAGE       1   4205.4 4239.4
## - GENDER               1   4300.4 4334.4
## - VEHICLE_YEAR         1   4593.7 4627.7
## - DRIVING_EXPERIENCE   3   4770.3 4800.3
## - VEHICLE_OWNERSHIP    1   4772.4 4806.4
## - POSTAL_CODE          3   4795.7 4825.7
## 
## Step:  AIC=4184.51
## OUTCOME ~ GENDER + DRIVING_EXPERIENCE + CREDIT_SCORE + VEHICLE_OWNERSHIP + 
##     VEHICLE_YEAR + MARRIED + CHILDREN + POSTAL_CODE + ANNUAL_MILEAGE + 
##     SPEEDING_VIOLATIONS + DUIS + PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - CREDIT_SCORE         1   4151.9 4183.9
## - CHILDREN             1   4151.9 4183.9
## - SPEEDING_VIOLATIONS  1   4152.1 4184.1
## - PAST_ACCIDENTS       1   4152.3 4184.3
## <none>                     4150.5 4184.5
## - DUIS                 1   4152.7 4184.7
## - MARRIED              1   4162.5 4194.5
## - ANNUAL_MILEAGE       1   4205.7 4237.7
## - GENDER               1   4301.4 4333.4
## - VEHICLE_YEAR         1   4593.8 4625.8
## - DRIVING_EXPERIENCE   3   4770.3 4798.3
## - VEHICLE_OWNERSHIP    1   4772.4 4804.4
## - POSTAL_CODE          3   4796.7 4824.7
## 
## Step:  AIC=4183.86
## OUTCOME ~ GENDER + DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + VEHICLE_YEAR + 
##     MARRIED + CHILDREN + POSTAL_CODE + ANNUAL_MILEAGE + SPEEDING_VIOLATIONS + 
##     DUIS + PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - CHILDREN             1   4153.0 4183.0
## - SPEEDING_VIOLATIONS  1   4153.4 4183.4
## - PAST_ACCIDENTS       1   4153.6 4183.6
## <none>                     4151.9 4183.9
## - DUIS                 1   4154.0 4184.0
## - MARRIED              1   4162.8 4192.8
## - ANNUAL_MILEAGE       1   4206.8 4236.8
## - GENDER               1   4301.5 4331.5
## - VEHICLE_YEAR         1   4608.1 4638.1
## - DRIVING_EXPERIENCE   3   4776.8 4802.8
## - POSTAL_CODE          3   4797.7 4823.7
## - VEHICLE_OWNERSHIP    1   4802.0 4832.0
## 
## Step:  AIC=4183.03
## OUTCOME ~ GENDER + DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + VEHICLE_YEAR + 
##     MARRIED + POSTAL_CODE + ANNUAL_MILEAGE + SPEEDING_VIOLATIONS + 
##     DUIS + PAST_ACCIDENTS
## 
##                       Df Deviance    AIC
## - SPEEDING_VIOLATIONS  1   4154.6 4182.6
## - PAST_ACCIDENTS       1   4154.7 4182.7
## <none>                     4153.0 4183.0
## - DUIS                 1   4155.2 4183.2
## - MARRIED              1   4164.4 4192.4
## - ANNUAL_MILEAGE       1   4227.8 4255.8
## - GENDER               1   4302.6 4330.6
## - VEHICLE_YEAR         1   4611.1 4639.1
## - DRIVING_EXPERIENCE   3   4795.7 4819.7
## - POSTAL_CODE          3   4803.3 4827.3
## - VEHICLE_OWNERSHIP    1   4804.0 4832.0
## 
## Step:  AIC=4182.62
## OUTCOME ~ GENDER + DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + VEHICLE_YEAR + 
##     MARRIED + POSTAL_CODE + ANNUAL_MILEAGE + DUIS + PAST_ACCIDENTS
## 
##                      Df Deviance    AIC
## - PAST_ACCIDENTS      1   4156.2 4182.2
## <none>                    4154.6 4182.6
## - DUIS                1   4157.2 4183.2
## - MARRIED             1   4166.1 4192.1
## - ANNUAL_MILEAGE      1   4227.8 4253.8
## - GENDER              1   4314.5 4340.5
## - VEHICLE_YEAR        1   4612.3 4638.3
## - VEHICLE_OWNERSHIP   1   4806.2 4832.2
## - POSTAL_CODE         3   4813.6 4835.6
## - DRIVING_EXPERIENCE  3   5010.4 5032.4
## 
## Step:  AIC=4182.18
## OUTCOME ~ GENDER + DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + VEHICLE_YEAR + 
##     MARRIED + POSTAL_CODE + ANNUAL_MILEAGE + DUIS
## 
##                      Df Deviance    AIC
## <none>                    4156.2 4182.2
## - DUIS                1   4158.9 4182.9
## - MARRIED             1   4167.6 4191.6
## - ANNUAL_MILEAGE      1   4233.7 4257.7
## - GENDER              1   4316.8 4340.8
## - VEHICLE_YEAR        1   4613.7 4637.7
## - VEHICLE_OWNERSHIP   1   4806.4 4830.4
## - POSTAL_CODE         3   4843.0 4863.0
## - DRIVING_EXPERIENCE  3   5592.6 5612.6
# summary model
summary(model_backward)
## 
## Call:
## glm(formula = OUTCOME ~ GENDER + DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + 
##     VEHICLE_YEAR + MARRIED + POSTAL_CODE + ANNUAL_MILEAGE + DUIS, 
##     family = "binomial", data = data_train)
## 
## Coefficients:
##                              Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)               -2.20441920   0.26226432  -8.405 < 0.0000000000000002
## GENDERmale                 0.99940299   0.08101164  12.337 < 0.0000000000000002
## DRIVING_EXPERIENCE10-19y  -2.14752235   0.09077430 -23.658 < 0.0000000000000002
## DRIVING_EXPERIENCE20-29y  -4.22403836   0.17692269 -23.875 < 0.0000000000000002
## DRIVING_EXPERIENCE30y+    -5.39223245   0.43910017 -12.280 < 0.0000000000000002
## VEHICLE_OWNERSHIP         -2.03681243   0.08606314 -23.666 < 0.0000000000000002
## VEHICLE_YEARbefore 2015    2.06009553   0.10758292  19.149 < 0.0000000000000002
## MARRIED1                  -0.30577182   0.09037675  -3.383             0.000716
## POSTAL_CODE21217          21.37003938 178.13671900   0.120             0.904511
## POSTAL_CODE32765           1.37773592   0.10126934  13.605 < 0.0000000000000002
## POSTAL_CODE92101           1.41774008   0.18702383   7.581   0.0000000000000344
## ANNUAL_MILEAGE             0.00014677   0.00001691   8.679 < 0.0000000000000002
## DUIS                       0.16724977   0.10021863   1.669             0.095147
##                             
## (Intercept)              ***
## GENDERmale               ***
## DRIVING_EXPERIENCE10-19y ***
## DRIVING_EXPERIENCE20-29y ***
## DRIVING_EXPERIENCE30y+   ***
## VEHICLE_OWNERSHIP        ***
## VEHICLE_YEARbefore 2015  ***
## MARRIED1                 ***
## POSTAL_CODE21217            
## POSTAL_CODE32765         ***
## POSTAL_CODE92101         ***
## ANNUAL_MILEAGE           ***
## DUIS                     .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8147.2  on 6518  degrees of freedom
## Residual deviance: 4156.2  on 6506  degrees of freedom
## AIC: 4182.2
## 
## Number of Fisher Scoring iterations: 15
exp(model_backward$coefficients)
##              (Intercept)               GENDERmale DRIVING_EXPERIENCE10-19y 
##               0.11031458               2.71665946               0.11677312 
## DRIVING_EXPERIENCE20-29y   DRIVING_EXPERIENCE30y+        VEHICLE_OWNERSHIP 
##               0.01463941               0.00455180               0.13044385 
##  VEHICLE_YEARbefore 2015                 MARRIED1         POSTAL_CODE21217 
##               7.84671936               0.73655467      1909370388.08310127 
##         POSTAL_CODE32765         POSTAL_CODE92101           ANNUAL_MILEAGE 
##               3.96591233               4.12778145               1.00014678 
##                     DUIS 
##               1.18204946

Model Forward

model_forward <- step(object = model_claim_null,
                      direction = "forward",
                      scope = list(lower = model_claim_null, upper = model_claim))
## Start:  AIC=8149.17
## OUTCOME ~ 1
## 
##                       Df Deviance    AIC
## + DRIVING_EXPERIENCE   3   6143.3 6151.3
## + AGE                  3   6661.3 6669.3
## + INCOME               3   7041.3 7049.3
## + PAST_ACCIDENTS       1   7178.2 7182.2
## + VEHICLE_OWNERSHIP    1   7182.6 7186.6
## + SPEEDING_VIOLATIONS  1   7321.7 7325.7
## + CREDIT_SCORE         1   7475.4 7479.4
## + VEHICLE_YEAR         1   7490.7 7494.7
## + MARRIED              1   7711.1 7715.1
## + CHILDREN             1   7799.2 7803.2
## + DUIS                 1   7848.8 7852.8
## + POSTAL_CODE          3   7855.5 7863.5
## + ANNUAL_MILEAGE       1   7904.9 7908.9
## + EDUCATION            2   7928.7 7934.7
## + GENDER               1   8084.9 8088.9
## + RACE                 1   8144.8 8148.8
## <none>                     8147.2 8149.2
## + VEHICLE_TYPE         1   8147.1 8151.1
## 
## Step:  AIC=6151.32
## OUTCOME ~ DRIVING_EXPERIENCE
## 
##                       Df Deviance    AIC
## + VEHICLE_OWNERSHIP    1   5457.7 5467.7
## + VEHICLE_YEAR         1   5638.8 5648.8
## + POSTAL_CODE          3   5694.3 5708.3
## + INCOME               3   5830.4 5844.4
## + CREDIT_SCORE         1   5950.2 5960.2
## + MARRIED              1   5997.4 6007.4
## + AGE                  3   5993.5 6007.5
## + GENDER               1   6046.3 6056.3
## + ANNUAL_MILEAGE       1   6053.4 6063.4
## + EDUCATION            2   6057.2 6069.2
## + CHILDREN             1   6076.0 6086.0
## + SPEEDING_VIOLATIONS  1   6134.7 6144.7
## + PAST_ACCIDENTS       1   6135.2 6145.2
## <none>                     6143.3 6151.3
## + DUIS                 1   6141.8 6151.8
## + VEHICLE_TYPE         1   6143.2 6153.2
## + RACE                 1   6143.3 6153.3
## 
## Step:  AIC=5467.74
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP
## 
##                       Df Deviance    AIC
## + POSTAL_CODE          3   4938.3 4954.3
## + VEHICLE_YEAR         1   5046.9 5058.9
## + GENDER               1   5343.0 5355.0
## + MARRIED              1   5376.3 5388.3
## + INCOME               3   5382.1 5398.1
## + ANNUAL_MILEAGE       1   5394.7 5406.7
## + AGE                  3   5397.4 5413.4
## + CREDIT_SCORE         1   5410.4 5422.4
## + CHILDREN             1   5411.4 5423.4
## + PAST_ACCIDENTS       1   5446.1 5458.1
## + SPEEDING_VIOLATIONS  1   5446.9 5458.9
## + EDUCATION            2   5445.0 5459.0
## + DUIS                 1   5454.0 5466.0
## <none>                     5457.7 5467.7
## + RACE                 1   5457.3 5469.3
## + VEHICLE_TYPE         1   5457.7 5469.7
## 
## Step:  AIC=4954.26
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE
## 
##                       Df Deviance    AIC
## + VEHICLE_YEAR         1   4463.7 4481.7
## + ANNUAL_MILEAGE       1   4789.0 4807.0
## + GENDER               1   4799.9 4817.9
## + MARRIED              1   4843.0 4861.0
## + INCOME               3   4851.0 4873.0
## + AGE                  3   4873.9 4895.9
## + CHILDREN             1   4880.9 4898.9
## + CREDIT_SCORE         1   4883.1 4901.1
## + EDUCATION            2   4922.8 4942.8
## + DUIS                 1   4931.4 4949.4
## + SPEEDING_VIOLATIONS  1   4935.3 4953.3
## <none>                     4938.3 4954.3
## + PAST_ACCIDENTS       1   4938.0 4956.0
## + RACE                 1   4938.2 4956.2
## + VEHICLE_TYPE         1   4938.2 4956.2
## 
## Step:  AIC=4481.68
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR
## 
##                       Df Deviance    AIC
## + GENDER               1   4307.9 4327.9
## + ANNUAL_MILEAGE       1   4332.0 4352.0
## + MARRIED              1   4400.5 4420.5
## + CHILDREN             1   4420.1 4440.1
## + AGE                  3   4447.9 4471.9
## + DUIS                 1   4457.7 4477.7
## + CREDIT_SCORE         1   4458.5 4478.5
## + SPEEDING_VIOLATIONS  1   4459.0 4479.0
## <none>                     4463.7 4481.7
## + INCOME               3   4457.7 4481.7
## + RACE                 1   4463.0 4483.0
## + VEHICLE_TYPE         1   4463.6 4483.6
## + PAST_ACCIDENTS       1   4463.6 4483.6
## + EDUCATION            2   4463.2 4485.2
## 
## Step:  AIC=4327.89
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER
## 
##                       Df Deviance    AIC
## + ANNUAL_MILEAGE       1   4170.1 4192.1
## + MARRIED              1   4236.5 4258.5
## + CHILDREN             1   4263.2 4285.2
## + AGE                  3   4290.5 4316.5
## + PAST_ACCIDENTS       1   4298.6 4320.6
## + INCOME               3   4299.0 4325.0
## + DUIS                 1   4305.7 4327.7
## <none>                     4307.9 4327.9
## + CREDIT_SCORE         1   4306.3 4328.3
## + RACE                 1   4307.6 4329.6
## + VEHICLE_TYPE         1   4307.7 4329.7
## + SPEEDING_VIOLATIONS  1   4307.7 4329.7
## + EDUCATION            2   4306.2 4330.2
## 
## Step:  AIC=4192.06
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE
## 
##                       Df Deviance    AIC
## + MARRIED              1   4158.9 4182.9
## + DUIS                 1   4167.6 4191.6
## <none>                     4170.1 4192.1
## + SPEEDING_VIOLATIONS  1   4168.2 4192.2
## + PAST_ACCIDENTS       1   4168.4 4192.4
## + CHILDREN             1   4168.5 4192.5
## + RACE                 1   4169.8 4193.8
## + CREDIT_SCORE         1   4169.8 4193.8
## + VEHICLE_TYPE         1   4170.0 4194.0
## + AGE                  3   4167.8 4195.8
## + EDUCATION            2   4169.9 4195.9
## + INCOME               3   4170.0 4198.0
## 
## Step:  AIC=4182.88
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE + MARRIED
## 
##                       Df Deviance    AIC
## + DUIS                 1   4156.2 4182.2
## <none>                     4158.9 4182.9
## + SPEEDING_VIOLATIONS  1   4157.0 4183.0
## + PAST_ACCIDENTS       1   4157.2 4183.2
## + CREDIT_SCORE         1   4157.7 4183.7
## + CHILDREN             1   4157.7 4183.7
## + RACE                 1   4158.5 4184.5
## + VEHICLE_TYPE         1   4158.9 4184.9
## + EDUCATION            2   4158.7 4186.7
## + AGE                  3   4157.8 4187.8
## + INCOME               3   4158.3 4188.3
## 
## Step:  AIC=4182.18
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE + MARRIED + DUIS
## 
##                       Df Deviance    AIC
## <none>                     4156.2 4182.2
## + PAST_ACCIDENTS       1   4154.6 4182.6
## + SPEEDING_VIOLATIONS  1   4154.7 4182.7
## + CREDIT_SCORE         1   4155.0 4183.0
## + CHILDREN             1   4155.1 4183.1
## + RACE                 1   4155.8 4183.8
## + VEHICLE_TYPE         1   4156.2 4184.2
## + EDUCATION            2   4156.0 4186.0
## + AGE                  3   4155.2 4187.2
## + INCOME               3   4155.6 4187.6
# summary model
summary(model_forward)
## 
## Call:
## glm(formula = OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + 
##     POSTAL_CODE + VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE + MARRIED + 
##     DUIS, family = "binomial", data = data_train)
## 
## Coefficients:
##                              Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)               -2.20441920   0.26226432  -8.405 < 0.0000000000000002
## DRIVING_EXPERIENCE10-19y  -2.14752235   0.09077430 -23.658 < 0.0000000000000002
## DRIVING_EXPERIENCE20-29y  -4.22403836   0.17692269 -23.875 < 0.0000000000000002
## DRIVING_EXPERIENCE30y+    -5.39223245   0.43910017 -12.280 < 0.0000000000000002
## VEHICLE_OWNERSHIP         -2.03681243   0.08606314 -23.666 < 0.0000000000000002
## POSTAL_CODE21217          21.37003938 178.13671900   0.120             0.904511
## POSTAL_CODE32765           1.37773592   0.10126934  13.605 < 0.0000000000000002
## POSTAL_CODE92101           1.41774008   0.18702383   7.581   0.0000000000000344
## VEHICLE_YEARbefore 2015    2.06009553   0.10758292  19.149 < 0.0000000000000002
## GENDERmale                 0.99940299   0.08101164  12.337 < 0.0000000000000002
## ANNUAL_MILEAGE             0.00014677   0.00001691   8.679 < 0.0000000000000002
## MARRIED1                  -0.30577182   0.09037675  -3.383             0.000716
## DUIS                       0.16724977   0.10021863   1.669             0.095147
##                             
## (Intercept)              ***
## DRIVING_EXPERIENCE10-19y ***
## DRIVING_EXPERIENCE20-29y ***
## DRIVING_EXPERIENCE30y+   ***
## VEHICLE_OWNERSHIP        ***
## POSTAL_CODE21217            
## POSTAL_CODE32765         ***
## POSTAL_CODE92101         ***
## VEHICLE_YEARbefore 2015  ***
## GENDERmale               ***
## ANNUAL_MILEAGE           ***
## MARRIED1                 ***
## DUIS                     .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8147.2  on 6518  degrees of freedom
## Residual deviance: 4156.2  on 6506  degrees of freedom
## AIC: 4182.2
## 
## Number of Fisher Scoring iterations: 15
exp(model_forward$coefficients)
##              (Intercept) DRIVING_EXPERIENCE10-19y DRIVING_EXPERIENCE20-29y 
##               0.11031458               0.11677312               0.01463941 
##   DRIVING_EXPERIENCE30y+        VEHICLE_OWNERSHIP         POSTAL_CODE21217 
##               0.00455180               0.13044385      1909370388.08812094 
##         POSTAL_CODE32765         POSTAL_CODE92101  VEHICLE_YEARbefore 2015 
##               3.96591233               4.12778145               7.84671936 
##               GENDERmale           ANNUAL_MILEAGE                 MARRIED1 
##               2.71665946               1.00014678               0.73655467 
##                     DUIS 
##               1.18204946

Model Both

model_both <- step(object = model_claim_null,
                   direction = "both",
                   scope = list(upper = model_claim))
## Start:  AIC=8149.17
## OUTCOME ~ 1
## 
##                       Df Deviance    AIC
## + DRIVING_EXPERIENCE   3   6143.3 6151.3
## + AGE                  3   6661.3 6669.3
## + INCOME               3   7041.3 7049.3
## + PAST_ACCIDENTS       1   7178.2 7182.2
## + VEHICLE_OWNERSHIP    1   7182.6 7186.6
## + SPEEDING_VIOLATIONS  1   7321.7 7325.7
## + CREDIT_SCORE         1   7475.4 7479.4
## + VEHICLE_YEAR         1   7490.7 7494.7
## + MARRIED              1   7711.1 7715.1
## + CHILDREN             1   7799.2 7803.2
## + DUIS                 1   7848.8 7852.8
## + POSTAL_CODE          3   7855.5 7863.5
## + ANNUAL_MILEAGE       1   7904.9 7908.9
## + EDUCATION            2   7928.7 7934.7
## + GENDER               1   8084.9 8088.9
## + RACE                 1   8144.8 8148.8
## <none>                     8147.2 8149.2
## + VEHICLE_TYPE         1   8147.1 8151.1
## 
## Step:  AIC=6151.32
## OUTCOME ~ DRIVING_EXPERIENCE
## 
##                       Df Deviance    AIC
## + VEHICLE_OWNERSHIP    1   5457.7 5467.7
## + VEHICLE_YEAR         1   5638.8 5648.8
## + POSTAL_CODE          3   5694.3 5708.3
## + INCOME               3   5830.4 5844.4
## + CREDIT_SCORE         1   5950.2 5960.2
## + MARRIED              1   5997.4 6007.4
## + AGE                  3   5993.5 6007.5
## + GENDER               1   6046.3 6056.3
## + ANNUAL_MILEAGE       1   6053.4 6063.4
## + EDUCATION            2   6057.2 6069.2
## + CHILDREN             1   6076.0 6086.0
## + SPEEDING_VIOLATIONS  1   6134.7 6144.7
## + PAST_ACCIDENTS       1   6135.2 6145.2
## <none>                     6143.3 6151.3
## + DUIS                 1   6141.8 6151.8
## + VEHICLE_TYPE         1   6143.2 6153.2
## + RACE                 1   6143.3 6153.3
## - DRIVING_EXPERIENCE   3   8147.2 8149.2
## 
## Step:  AIC=5467.74
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP
## 
##                       Df Deviance    AIC
## + POSTAL_CODE          3   4938.3 4954.3
## + VEHICLE_YEAR         1   5046.9 5058.9
## + GENDER               1   5343.0 5355.0
## + MARRIED              1   5376.3 5388.3
## + INCOME               3   5382.1 5398.1
## + ANNUAL_MILEAGE       1   5394.7 5406.7
## + AGE                  3   5397.4 5413.4
## + CREDIT_SCORE         1   5410.4 5422.4
## + CHILDREN             1   5411.4 5423.4
## + PAST_ACCIDENTS       1   5446.1 5458.1
## + SPEEDING_VIOLATIONS  1   5446.9 5458.9
## + EDUCATION            2   5445.0 5459.0
## + DUIS                 1   5454.0 5466.0
## <none>                     5457.7 5467.7
## + RACE                 1   5457.3 5469.3
## + VEHICLE_TYPE         1   5457.7 5469.7
## - VEHICLE_OWNERSHIP    1   6143.3 6151.3
## - DRIVING_EXPERIENCE   3   7182.6 7186.6
## 
## Step:  AIC=4954.26
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE
## 
##                       Df Deviance    AIC
## + VEHICLE_YEAR         1   4463.7 4481.7
## + ANNUAL_MILEAGE       1   4789.0 4807.0
## + GENDER               1   4799.9 4817.9
## + MARRIED              1   4843.0 4861.0
## + INCOME               3   4851.0 4873.0
## + AGE                  3   4873.9 4895.9
## + CHILDREN             1   4880.9 4898.9
## + CREDIT_SCORE         1   4883.1 4901.1
## + EDUCATION            2   4922.8 4942.8
## + DUIS                 1   4931.4 4949.4
## + SPEEDING_VIOLATIONS  1   4935.3 4953.3
## <none>                     4938.3 4954.3
## + PAST_ACCIDENTS       1   4938.0 4956.0
## + RACE                 1   4938.2 4956.2
## + VEHICLE_TYPE         1   4938.2 4956.2
## - POSTAL_CODE          3   5457.7 5467.7
## - VEHICLE_OWNERSHIP    1   5694.3 5708.3
## - DRIVING_EXPERIENCE   3   6839.0 6849.0
## 
## Step:  AIC=4481.68
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR
## 
##                       Df Deviance    AIC
## + GENDER               1   4307.9 4327.9
## + ANNUAL_MILEAGE       1   4332.0 4352.0
## + MARRIED              1   4400.5 4420.5
## + CHILDREN             1   4420.1 4440.1
## + AGE                  3   4447.9 4471.9
## + DUIS                 1   4457.7 4477.7
## + CREDIT_SCORE         1   4458.5 4478.5
## + SPEEDING_VIOLATIONS  1   4459.0 4479.0
## <none>                     4463.7 4481.7
## + INCOME               3   4457.7 4481.7
## + RACE                 1   4463.0 4483.0
## + VEHICLE_TYPE         1   4463.6 4483.6
## + PAST_ACCIDENTS       1   4463.6 4483.6
## + EDUCATION            2   4463.2 4485.2
## - VEHICLE_YEAR         1   4938.3 4954.3
## - POSTAL_CODE          3   5046.9 5058.9
## - VEHICLE_OWNERSHIP    1   5127.6 5143.6
## - DRIVING_EXPERIENCE   3   6287.6 6299.6
## 
## Step:  AIC=4327.89
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER
## 
##                       Df Deviance    AIC
## + ANNUAL_MILEAGE       1   4170.1 4192.1
## + MARRIED              1   4236.5 4258.5
## + CHILDREN             1   4263.2 4285.2
## + AGE                  3   4290.5 4316.5
## + PAST_ACCIDENTS       1   4298.6 4320.6
## + INCOME               3   4299.0 4325.0
## + DUIS                 1   4305.7 4327.7
## <none>                     4307.9 4327.9
## + CREDIT_SCORE         1   4306.3 4328.3
## + RACE                 1   4307.6 4329.6
## + VEHICLE_TYPE         1   4307.7 4329.7
## + SPEEDING_VIOLATIONS  1   4307.7 4329.7
## + EDUCATION            2   4306.2 4330.2
## - GENDER               1   4463.7 4481.7
## - VEHICLE_YEAR         1   4799.9 4817.9
## - POSTAL_CODE          3   4920.9 4934.9
## - VEHICLE_OWNERSHIP    1   4999.0 5017.0
## - DRIVING_EXPERIENCE   3   6183.3 6197.3
## 
## Step:  AIC=4192.06
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE
## 
##                       Df Deviance    AIC
## + MARRIED              1   4158.9 4182.9
## + DUIS                 1   4167.6 4191.6
## <none>                     4170.1 4192.1
## + SPEEDING_VIOLATIONS  1   4168.2 4192.2
## + PAST_ACCIDENTS       1   4168.4 4192.4
## + CHILDREN             1   4168.5 4192.5
## + RACE                 1   4169.8 4193.8
## + CREDIT_SCORE         1   4169.8 4193.8
## + VEHICLE_TYPE         1   4170.0 4194.0
## + AGE                  3   4167.8 4195.8
## + EDUCATION            2   4169.9 4195.9
## + INCOME               3   4170.0 4198.0
## - ANNUAL_MILEAGE       1   4307.9 4327.9
## - GENDER               1   4332.0 4352.0
## - VEHICLE_YEAR         1   4643.2 4663.2
## - VEHICLE_OWNERSHIP    1   4841.0 4861.0
## - POSTAL_CODE          3   4869.2 4885.2
## - DRIVING_EXPERIENCE   3   5890.5 5906.5
## 
## Step:  AIC=4182.88
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE + MARRIED
## 
##                       Df Deviance    AIC
## + DUIS                 1   4156.2 4182.2
## <none>                     4158.9 4182.9
## + SPEEDING_VIOLATIONS  1   4157.0 4183.0
## + PAST_ACCIDENTS       1   4157.2 4183.2
## + CREDIT_SCORE         1   4157.7 4183.7
## + CHILDREN             1   4157.7 4183.7
## + RACE                 1   4158.5 4184.5
## + VEHICLE_TYPE         1   4158.9 4184.9
## + EDUCATION            2   4158.7 4186.7
## + AGE                  3   4157.8 4187.8
## + INCOME               3   4158.3 4188.3
## - MARRIED              1   4170.1 4192.1
## - ANNUAL_MILEAGE       1   4236.5 4258.5
## - GENDER               1   4323.5 4345.5
## - VEHICLE_YEAR         1   4617.3 4639.3
## - VEHICLE_OWNERSHIP    1   4806.9 4828.9
## - POSTAL_CODE          3   4844.4 4862.4
## - DRIVING_EXPERIENCE   3   5812.5 5830.5
## 
## Step:  AIC=4182.18
## OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + POSTAL_CODE + 
##     VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE + MARRIED + DUIS
## 
##                       Df Deviance    AIC
## <none>                     4156.2 4182.2
## + PAST_ACCIDENTS       1   4154.6 4182.6
## + SPEEDING_VIOLATIONS  1   4154.7 4182.7
## - DUIS                 1   4158.9 4182.9
## + CREDIT_SCORE         1   4155.0 4183.0
## + CHILDREN             1   4155.1 4183.1
## + RACE                 1   4155.8 4183.8
## + VEHICLE_TYPE         1   4156.2 4184.2
## + EDUCATION            2   4156.0 4186.0
## + AGE                  3   4155.2 4187.2
## + INCOME               3   4155.6 4187.6
## - MARRIED              1   4167.6 4191.6
## - ANNUAL_MILEAGE       1   4233.7 4257.7
## - GENDER               1   4316.8 4340.8
## - VEHICLE_YEAR         1   4613.7 4637.7
## - VEHICLE_OWNERSHIP    1   4806.4 4830.4
## - POSTAL_CODE          3   4843.0 4863.0
## - DRIVING_EXPERIENCE   3   5592.6 5612.6
# summary model
summary(model_both)
## 
## Call:
## glm(formula = OUTCOME ~ DRIVING_EXPERIENCE + VEHICLE_OWNERSHIP + 
##     POSTAL_CODE + VEHICLE_YEAR + GENDER + ANNUAL_MILEAGE + MARRIED + 
##     DUIS, family = "binomial", data = data_train)
## 
## Coefficients:
##                              Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)               -2.20441920   0.26226432  -8.405 < 0.0000000000000002
## DRIVING_EXPERIENCE10-19y  -2.14752235   0.09077430 -23.658 < 0.0000000000000002
## DRIVING_EXPERIENCE20-29y  -4.22403836   0.17692269 -23.875 < 0.0000000000000002
## DRIVING_EXPERIENCE30y+    -5.39223245   0.43910017 -12.280 < 0.0000000000000002
## VEHICLE_OWNERSHIP         -2.03681243   0.08606314 -23.666 < 0.0000000000000002
## POSTAL_CODE21217          21.37003938 178.13671900   0.120             0.904511
## POSTAL_CODE32765           1.37773592   0.10126934  13.605 < 0.0000000000000002
## POSTAL_CODE92101           1.41774008   0.18702383   7.581   0.0000000000000344
## VEHICLE_YEARbefore 2015    2.06009553   0.10758292  19.149 < 0.0000000000000002
## GENDERmale                 0.99940299   0.08101164  12.337 < 0.0000000000000002
## ANNUAL_MILEAGE             0.00014677   0.00001691   8.679 < 0.0000000000000002
## MARRIED1                  -0.30577182   0.09037675  -3.383             0.000716
## DUIS                       0.16724977   0.10021863   1.669             0.095147
##                             
## (Intercept)              ***
## DRIVING_EXPERIENCE10-19y ***
## DRIVING_EXPERIENCE20-29y ***
## DRIVING_EXPERIENCE30y+   ***
## VEHICLE_OWNERSHIP        ***
## POSTAL_CODE21217            
## POSTAL_CODE32765         ***
## POSTAL_CODE92101         ***
## VEHICLE_YEARbefore 2015  ***
## GENDERmale               ***
## ANNUAL_MILEAGE           ***
## MARRIED1                 ***
## DUIS                     .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8147.2  on 6518  degrees of freedom
## Residual deviance: 4156.2  on 6506  degrees of freedom
## AIC: 4182.2
## 
## Number of Fisher Scoring iterations: 15
exp(model_both$coefficients)
##              (Intercept) DRIVING_EXPERIENCE10-19y DRIVING_EXPERIENCE20-29y 
##               0.11031458               0.11677312               0.01463941 
##   DRIVING_EXPERIENCE30y+        VEHICLE_OWNERSHIP         POSTAL_CODE21217 
##               0.00455180               0.13044385      1909370388.08812094 
##         POSTAL_CODE32765         POSTAL_CODE92101  VEHICLE_YEARbefore 2015 
##               3.96591233               4.12778145               7.84671936 
##               GENDERmale           ANNUAL_MILEAGE                 MARRIED1 
##               2.71665946               1.00014678               0.73655467 
##                     DUIS 
##               1.18204946

COMPARISON

comparison <- compare_performance(model_claim,
                                  model_claim_null,
                                  model_backward,
                                  model_forward,
                                  model_both)

comparison

💡 Insight:

Good Model Criteria:
* have greater value of adjusted R-squared
* have lower AIC score
* have lower RMSE score
* from three point above, we can assume that model_backward , model_forward and model_both fulfill the criteria given
* we choose model_backward to predict our data_test

PREDICTION

library(gtools)
## 
## Attaching package: 'gtools'
## The following object is masked from 'package:car':
## 
##     logit
log_claim <- predict(object = model_backward,
                     newdata = data_test,
                     type = "link")

p_claim <- inv.logit(log_claim)
claim_test_pred <- ifelse(test = p_claim >= 0.5,
                          yes = 1,
                          no = 0)
data_test$OUTCOME_PRED <- claim_test_pred
data_test %>% 
  select(OUTCOME,OUTCOME_PRED)
confusionMatrix(data = as.factor(data_test$OUTCOME_PRED),
                reference = as.factor(data_test$OUTCOME),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1053  127
##          1  110  340
##                                              
##                Accuracy : 0.8546             
##                  95% CI : (0.8365, 0.8714)   
##     No Information Rate : 0.7135             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6404             
##                                              
##  Mcnemar's Test P-Value : 0.2987             
##                                              
##             Sensitivity : 0.7281             
##             Specificity : 0.9054             
##          Pos Pred Value : 0.7556             
##          Neg Pred Value : 0.8924             
##              Prevalence : 0.2865             
##          Detection Rate : 0.2086             
##    Detection Prevalence : 0.2761             
##       Balanced Accuracy : 0.8167             
##                                              
##        'Positive' Class : 1                  
## 

EVALUATION

💡 Insight:

  • from the confusion Matrix result, our model can predict OUTCOME class with accuracy 85.46%
  • in this case, we assume that we need to focus on RECALL value because company would like to reduce the number of claim they paid
  • RECALL or SENSITIVITY value of this model is 72.81%

MODELING

KNN

knn_set <- dataset %>% 
  select_if(is.numeric) %>% 
  select(-ID) %>% 
  mutate(OUTCOME = as.factor(OUTCOME)) %>% 
  drop_na()

table(knn_set$OUTCOME) %>% 
  prop.table()
## 
##         0         1 
## 0.6887962 0.3112038
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

# index sampling
index <- sample(x = nrow(knn_set), 
                size = 0.8*nrow(knn_set) ) # mau ambil 0.8 (80%) utk data train, sisanya utk data test

# splitting
knn_train <- knn_set[index,] # ambil yang barisnya termasuk di dalam index
knn_test <- knn_set[-index,] # ambil yang barisnya tidak termasuk di dalam index

# separating predictor and target
knn_train_x <- knn_train %>% select_if(is.numeric)
knn_test_x <- knn_test %>% select_if(is.numeric)

knn_train_y <- knn_train[,"OUTCOME"]
knn_test_y <- knn_test[, "OUTCOME"]

# scaling data train and test
knn_train_xs <- scale(x = knn_train_x)
knn_test_xs <- scale(x = knn_test_x,
                     center = attr(knn_train_xs,"scaled:center"),
                     scale = attr(knn_train_xs,"scaled:scale"))
# finding k-optimum
k_opt <- sqrt(nrow(knn_set))

# prediction
knn_pred <- knn(train = knn_train_xs,
                 test = knn_test_xs,
                 cl = knn_train_y,
                 k = k_opt)

confusionMatrix(data = knn_pred,        
                reference = knn_test_y, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 992 209
##          1 148 281
##                                              
##                Accuracy : 0.781              
##                  95% CI : (0.7601, 0.8008)   
##     No Information Rate : 0.6994             
##     P-Value [Acc > NIR] : 0.00000000000008357
##                                              
##                   Kappa : 0.46               
##                                              
##  Mcnemar's Test P-Value : 0.001496           
##                                              
##             Sensitivity : 0.5735             
##             Specificity : 0.8702             
##          Pos Pred Value : 0.6550             
##          Neg Pred Value : 0.8260             
##              Prevalence : 0.3006             
##          Detection Rate : 0.1724             
##    Detection Prevalence : 0.2632             
##       Balanced Accuracy : 0.7218             
##                                              
##        'Positive' Class : 1                  
## 

💡 Insight:

  • from the confusion Matrix result, our model can predict OUTCOME class with accuracy 78.16%
  • in this case, we assume that we need to focus on RECALL value because company would like to reduce the number of claim they paid
  • RECALL or SENSITIVITY value of this model is 57.14%

SUMMARY

Metrics Performa Log Reg KNN
Accuracy 0.8546 0.7816
Recall (Sensitivity) 0.7281 0.5714
Precission (Pos Pred Value) 0.7556 0.6573
Specificity 0.9054 0.8719

by this result, our Logistic Regression Model has better performance than KNN Model. This might be possible because Logistic Regression has more predictor rather than KNN (in KNN we do not use categorical predictor).

Title  

A work by Taufan Anggoro Adhi

tf.anggoro@gmail.com