1 Introduction

The aim of this challenge is to predict the probability that a driver will initiate an auto insurance claim in the next year. These predictions will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers. This competition was one of the most popular on Kaggle in the featured category.

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

1.1 Load libraries and data files

# Required libraries
library(tidyverse)
library(data.table)
library(caret)
library(gridExtra)
library(plotly)
library(corrplot)
library(DT)

# Reading data files with replacing -1 as NA
train <- fread("train.csv", na = c("-1","-1.0"))
test <- fread("test.csv", na = c("-1","-1.0"))

1.2 Train data

It is clear from the variable target that the distribution of class labels is highly imbalanced with the positive class amounting to only 3.645 percent.

summary(train)

##        id              target          ps_ind_01   ps_ind_02_cat 
##  Min.   :      7   Min.   :0.00000   Min.   :0.0   Min.   :1.00  
##  1st Qu.: 371992   1st Qu.:0.00000   1st Qu.:0.0   1st Qu.:1.00  
##  Median : 743548   Median :0.00000   Median :1.0   Median :1.00  
##  Mean   : 743804   Mean   :0.03645   Mean   :1.9   Mean   :1.36  
##  3rd Qu.:1115549   3rd Qu.:0.00000   3rd Qu.:3.0   3rd Qu.:2.00  
##  Max.   :1488027   Max.   :1.00000   Max.   :7.0   Max.   :4.00  
##                                                    NA's   :216   
##    ps_ind_03      ps_ind_04_cat   ps_ind_05_cat   ps_ind_06_bin   
##  Min.   : 0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median : 4.000   Median :0.000   Median :0.000   Median :0.0000  
##  Mean   : 4.423   Mean   :0.417   Mean   :0.419   Mean   :0.3937  
##  3rd Qu.: 6.000   3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:1.0000  
##  Max.   :11.000   Max.   :1.000   Max.   :6.000   Max.   :1.0000  
##                   NA's   :83      NA's   :5809                    
##  ps_ind_07_bin   ps_ind_08_bin    ps_ind_09_bin    ps_ind_10_bin     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000000  
##  Median :0.000   Median :0.0000   Median :0.0000   Median :0.000000  
##  Mean   :0.257   Mean   :0.1639   Mean   :0.1853   Mean   :0.000373  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000000  
##                                                                      
##  ps_ind_11_bin      ps_ind_12_bin      ps_ind_13_bin      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.000000   Median :0.000000   Median :0.0000000  
##  Mean   :0.001692   Mean   :0.009439   Mean   :0.0009476  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.0000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.0000000  
##                                                           
##    ps_ind_14         ps_ind_15    ps_ind_16_bin    ps_ind_17_bin   
##  Min.   :0.00000   Min.   : 0.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.: 5.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median : 7.0   Median :1.0000   Median :0.0000  
##  Mean   :0.01245   Mean   : 7.3   Mean   :0.6608   Mean   :0.1211  
##  3rd Qu.:0.00000   3rd Qu.:10.0   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :4.00000   Max.   :13.0   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##  ps_ind_18_bin      ps_reg_01       ps_reg_02        ps_reg_03     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.06    
##  1st Qu.:0.0000   1st Qu.:0.400   1st Qu.:0.2000   1st Qu.:0.63    
##  Median :0.0000   Median :0.700   Median :0.3000   Median :0.80    
##  Mean   :0.1534   Mean   :0.611   Mean   :0.4392   Mean   :0.89    
##  3rd Qu.:0.0000   3rd Qu.:0.900   3rd Qu.:0.6000   3rd Qu.:1.08    
##  Max.   :1.0000   Max.   :0.900   Max.   :1.8000   Max.   :4.04    
##                                                    NA's   :107772  
##  ps_car_01_cat    ps_car_02_cat    ps_car_03_cat    ps_car_04_cat   
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0      Min.   :0.0000  
##  1st Qu.: 7.000   1st Qu.:1.0000   1st Qu.:0.0      1st Qu.:0.0000  
##  Median : 7.000   Median :1.0000   Median :1.0      Median :0.0000  
##  Mean   : 8.298   Mean   :0.8299   Mean   :0.6      Mean   :0.7252  
##  3rd Qu.:11.000   3rd Qu.:1.0000   3rd Qu.:1.0      3rd Qu.:0.0000  
##  Max.   :11.000   Max.   :1.0000   Max.   :1.0      Max.   :9.0000  
##  NA's   :107      NA's   :5        NA's   :411231                   
##  ps_car_05_cat    ps_car_06_cat    ps_car_07_cat   ps_car_08_cat   
##  Min.   :0.00     Min.   : 0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.00     1st Qu.: 1.000   1st Qu.:1.000   1st Qu.:1.0000  
##  Median :1.00     Median : 7.000   Median :1.000   Median :1.0000  
##  Mean   :0.53     Mean   : 6.555   Mean   :0.948   Mean   :0.8321  
##  3rd Qu.:1.00     3rd Qu.:11.000   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :1.00     Max.   :17.000   Max.   :1.000   Max.   :1.0000  
##  NA's   :266551                    NA's   :11489                   
##  ps_car_09_cat   ps_car_10_cat    ps_car_11_cat      ps_car_11    
##  Min.   :0.000   Min.   :0.0000   Min.   :  1.00   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.: 32.00   1st Qu.:2.000  
##  Median :2.000   Median :1.0000   Median : 65.00   Median :3.000  
##  Mean   :1.331   Mean   :0.9921   Mean   : 62.22   Mean   :2.346  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 93.00   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :2.0000   Max.   :104.00   Max.   :3.000  
##  NA's   :569                                       NA's   :5      
##    ps_car_12        ps_car_13        ps_car_14       ps_car_15    
##  Min.   :0.1000   Min.   :0.2506   Min.   :0.11    Min.   :0.000  
##  1st Qu.:0.3162   1st Qu.:0.6709   1st Qu.:0.35    1st Qu.:2.828  
##  Median :0.3742   Median :0.7658   Median :0.37    Median :3.317  
##  Mean   :0.3799   Mean   :0.8133   Mean   :0.37    Mean   :3.066  
##  3rd Qu.:0.4000   3rd Qu.:0.9062   3rd Qu.:0.40    3rd Qu.:3.606  
##  Max.   :1.2649   Max.   :3.7206   Max.   :0.64    Max.   :3.742  
##  NA's   :1                         NA's   :42620                  
##    ps_calc_01       ps_calc_02       ps_calc_03       ps_calc_04   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.2000   1st Qu.:0.2000   1st Qu.:0.2000   1st Qu.:2.000  
##  Median :0.5000   Median :0.4000   Median :0.5000   Median :2.000  
##  Mean   :0.4498   Mean   :0.4496   Mean   :0.4498   Mean   :2.372  
##  3rd Qu.:0.7000   3rd Qu.:0.7000   3rd Qu.:0.7000   3rd Qu.:3.000  
##  Max.   :0.9000   Max.   :0.9000   Max.   :0.9000   Max.   :5.000  
##                                                                    
##    ps_calc_05      ps_calc_06       ps_calc_07      ps_calc_08    
##  Min.   :0.000   Min.   : 0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:1.000   1st Qu.: 7.000   1st Qu.:2.000   1st Qu.: 8.000  
##  Median :2.000   Median : 8.000   Median :3.000   Median : 9.000  
##  Mean   :1.886   Mean   : 7.689   Mean   :3.006   Mean   : 9.226  
##  3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.:4.000   3rd Qu.:10.000  
##  Max.   :6.000   Max.   :10.000   Max.   :9.000   Max.   :12.000  
##                                                                   
##    ps_calc_09      ps_calc_10       ps_calc_11       ps_calc_12    
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.: 6.000   1st Qu.: 4.000   1st Qu.: 1.000  
##  Median :2.000   Median : 8.000   Median : 5.000   Median : 1.000  
##  Mean   :2.339   Mean   : 8.434   Mean   : 5.441   Mean   : 1.442  
##  3rd Qu.:3.000   3rd Qu.:10.000   3rd Qu.: 7.000   3rd Qu.: 2.000  
##  Max.   :7.000   Max.   :25.000   Max.   :19.000   Max.   :10.000  
##                                                                    
##    ps_calc_13       ps_calc_14     ps_calc_15_bin   ps_calc_16_bin  
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.: 6.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000   Median : 7.000   Median :0.0000   Median :1.0000  
##  Mean   : 2.872   Mean   : 7.539   Mean   :0.1224   Mean   :0.6278  
##  3rd Qu.: 4.000   3rd Qu.: 9.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :13.000   Max.   :23.000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##  ps_calc_17_bin   ps_calc_18_bin   ps_calc_19_bin  ps_calc_20_bin  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.5542   Mean   :0.2872   Mean   :0.349   Mean   :0.1533  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##

glimpse(train)

## Observations: 595,212
## Variables: 59
## $ id             <int> 7, 9, 13, 16, 17, 19, 20, 22, 26, 28, 34, 35, 3...
## $ target         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_01      <int> 2, 1, 5, 0, 0, 5, 2, 5, 5, 1, 5, 2, 2, 1, 5, 5,...
## $ ps_ind_02_cat  <int> 2, 1, 4, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
## $ ps_ind_03      <int> 5, 7, 9, 2, 0, 4, 3, 4, 3, 2, 2, 3, 1, 3, 11, 3...
## $ ps_ind_04_cat  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ ps_ind_05_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_06_bin  <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_07_bin  <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,...
## $ ps_ind_08_bin  <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ ps_ind_09_bin  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ ps_ind_10_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15      <int> 11, 3, 12, 8, 9, 6, 8, 13, 6, 4, 3, 9, 10, 12, ...
## $ ps_ind_16_bin  <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ ps_ind_17_bin  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ ps_reg_01      <dbl> 0.7, 0.8, 0.0, 0.9, 0.7, 0.9, 0.6, 0.7, 0.9, 0....
## $ ps_reg_02      <dbl> 0.2, 0.4, 0.0, 0.2, 0.6, 1.8, 0.1, 0.4, 0.7, 1....
## $ ps_reg_03      <dbl> 0.7180703, 0.7660777, NA, 0.5809475, 0.8407586,...
## $ ps_car_01_cat  <int> 10, 11, 7, 7, 11, 10, 6, 11, 10, 11, 11, 11, 6,...
## $ ps_car_02_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
## $ ps_car_03_cat  <int> NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, NA, NA, NA...
## $ ps_car_04_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 9,...
## $ ps_car_05_cat  <int> 1, NA, NA, 1, NA, 0, 1, 0, 1, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat  <int> 4, 11, 14, 11, 14, 14, 11, 11, 14, 14, 13, 11, ...
## $ ps_car_07_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_08_cat  <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ ps_car_09_cat  <int> 0, 2, 2, 3, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0,...
## $ ps_car_10_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat  <int> 12, 19, 60, 104, 82, 104, 99, 30, 68, 104, 20, ...
## $ ps_car_11      <int> 2, 3, 1, 1, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2,...
## $ ps_car_12      <dbl> 0.4000000, 0.3162278, 0.3162278, 0.3741657, 0.3...
## $ ps_car_13      <dbl> 0.8836789, 0.6188165, 0.6415857, 0.5429488, 0.5...
## $ ps_car_14      <dbl> 0.3708099, 0.3887158, 0.3472751, 0.2949576, 0.3...
## $ ps_car_15      <dbl> 3.605551, 2.449490, 3.316625, 2.000000, 2.00000...
## $ ps_calc_01     <dbl> 0.6, 0.3, 0.5, 0.6, 0.4, 0.7, 0.2, 0.1, 0.9, 0....
## $ ps_calc_02     <dbl> 0.5, 0.1, 0.7, 0.9, 0.6, 0.8, 0.6, 0.5, 0.8, 0....
## $ ps_calc_03     <dbl> 0.2, 0.3, 0.1, 0.1, 0.0, 0.4, 0.5, 0.1, 0.6, 0....
## $ ps_calc_04     <int> 3, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 2,...
## $ ps_calc_05     <int> 1, 1, 2, 4, 2, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1,...
## $ ps_calc_06     <int> 10, 9, 9, 7, 6, 8, 8, 7, 7, 8, 8, 8, 8, 10, 8, ...
## $ ps_calc_07     <int> 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 2, 2, 4, 1, 2, 5,...
## $ ps_calc_08     <int> 10, 8, 8, 8, 10, 11, 8, 6, 9, 9, 9, 10, 11, 8, ...
## $ ps_calc_09     <int> 1, 1, 2, 4, 2, 3, 3, 1, 4, 1, 4, 1, 1, 3, 3, 2,...
## $ ps_calc_10     <int> 5, 7, 7, 2, 12, 8, 10, 13, 11, 11, 7, 8, 9, 8, ...
## $ ps_calc_11     <int> 9, 3, 4, 2, 3, 4, 3, 7, 4, 3, 6, 9, 6, 2, 4, 5,...
## $ ps_calc_12     <int> 1, 1, 2, 2, 1, 2, 0, 1, 2, 5, 3, 2, 3, 0, 1, 2,...
## $ ps_calc_13     <int> 5, 1, 7, 4, 1, 0, 0, 3, 1, 0, 3, 1, 3, 4, 3, 6,...
## $ ps_calc_14     <int> 8, 9, 7, 9, 3, 9, 10, 6, 5, 6, 6, 10, 8, 3, 9, ...
## $ ps_calc_15_bin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,...
## $ ps_calc_17_bin <int> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,...
## $ ps_calc_18_bin <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ ps_calc_19_bin <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,...
## $ ps_calc_20_bin <int> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...

1.3 Test data

The summary of test data is similar to that of train.

summary(test)

##        id            ps_ind_01     ps_ind_02_cat     ps_ind_03     
##  Min.   :      0   Min.   :0.000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 372022   1st Qu.:0.000   1st Qu.:1.000   1st Qu.: 2.000  
##  Median : 744307   Median :1.000   Median :1.000   Median : 4.000  
##  Mean   : 744154   Mean   :1.902   Mean   :1.359   Mean   : 4.414  
##  3rd Qu.:1116309   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 6.000  
##  Max.   :1488026   Max.   :7.000   Max.   :4.000   Max.   :11.000  
##                                    NA's   :307                     
##  ps_ind_04_cat    ps_ind_05_cat   ps_ind_06_bin    ps_ind_07_bin   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :0.0000  
##  Mean   :0.4176   Mean   :0.422   Mean   :0.3932   Mean   :0.2572  
##  3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :145      NA's   :8710                                     
##  ps_ind_08_bin    ps_ind_09_bin    ps_ind_10_bin      ps_ind_11_bin     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.000000  
##  Median :0.0000   Median :0.0000   Median :0.000000   Median :0.000000  
##  Mean   :0.1637   Mean   :0.1859   Mean   :0.000373   Mean   :0.001595  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.000000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000000   Max.   :1.000000  
##                                                                         
##  ps_ind_12_bin      ps_ind_13_bin        ps_ind_14         ps_ind_15     
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00000   Min.   : 0.000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.: 5.000  
##  Median :0.000000   Median :0.000000   Median :0.00000   Median : 7.000  
##  Mean   :0.009376   Mean   :0.001039   Mean   :0.01238   Mean   : 7.297  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:10.000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :4.00000   Max.   :13.000  
##                                                                          
##  ps_ind_16_bin    ps_ind_17_bin    ps_ind_18_bin     ps_reg_01     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.4000  
##  Median :1.0000   Median :0.0000   Median :0.000   Median :0.7000  
##  Mean   :0.6606   Mean   :0.1204   Mean   :0.155   Mean   :0.6111  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.9000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :0.9000  
##                                                                    
##    ps_reg_02        ps_reg_03      ps_car_01_cat    ps_car_02_cat 
##  Min.   :0.0000   Min.   :0.06     Min.   : 0.000   Min.   :0.00  
##  1st Qu.:0.2000   1st Qu.:0.63     1st Qu.: 7.000   1st Qu.:1.00  
##  Median :0.3000   Median :0.80     Median : 7.000   Median :1.00  
##  Mean   :0.4399   Mean   :0.89     Mean   : 8.294   Mean   :0.83  
##  3rd Qu.:0.6000   3rd Qu.:1.09     3rd Qu.:11.000   3rd Qu.:1.00  
##  Max.   :1.8000   Max.   :4.42     Max.   :11.000   Max.   :1.00  
##                   NA's   :161684   NA's   :160      NA's   :5     
##  ps_car_03_cat    ps_car_04_cat    ps_car_05_cat    ps_car_06_cat   
##  Min.   :0.0      Min.   :0.0000   Min.   :0.0      Min.   : 0.000  
##  1st Qu.:0.0      1st Qu.:0.0000   1st Qu.:0.0      1st Qu.: 1.000  
##  Median :1.0      Median :0.0000   Median :1.0      Median : 7.000  
##  Mean   :0.6      Mean   :0.7258   Mean   :0.5      Mean   : 6.564  
##  3rd Qu.:1.0      3rd Qu.:0.0000   3rd Qu.:1.0      3rd Qu.:11.000  
##  Max.   :1.0      Max.   :9.0000   Max.   :1.0      Max.   :17.000  
##  NA's   :616911                    NA's   :400359                   
##  ps_car_07_cat   ps_car_08_cat    ps_car_09_cat  ps_car_10_cat   
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.00   1st Qu.:1.0000  
##  Median :1.000   Median :1.0000   Median :2.00   Median :1.0000  
##  Mean   :0.948   Mean   :0.8323   Mean   :1.33   Mean   :0.9921  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:2.00   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :4.00   Max.   :2.0000  
##  NA's   :17331                    NA's   :877                    
##  ps_car_11_cat      ps_car_11       ps_car_12        ps_car_13     
##  Min.   :  1.00   Min.   :0.000   Min.   :0.1414   Min.   :0.2758  
##  1st Qu.: 32.00   1st Qu.:2.000   1st Qu.:0.3162   1st Qu.:0.6712  
##  Median : 65.00   Median :3.000   Median :0.3742   Median :0.7661  
##  Mean   : 62.28   Mean   :2.347   Mean   :0.3800   Mean   :0.8136  
##  3rd Qu.: 94.00   3rd Qu.:3.000   3rd Qu.:0.4000   3rd Qu.:0.9061  
##  Max.   :104.00   Max.   :3.000   Max.   :1.2649   Max.   :4.0313  
##                   NA's   :1                                        
##    ps_car_14       ps_car_15       ps_calc_01       ps_calc_02    
##  Min.   :0.11    Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.35    1st Qu.:2.828   1st Qu.:0.2000   1st Qu.:0.2000  
##  Median :0.37    Median :3.317   Median :0.4000   Median :0.5000  
##  Mean   :0.37    Mean   :3.068   Mean   :0.4496   Mean   :0.4505  
##  3rd Qu.:0.40    3rd Qu.:3.606   3rd Qu.:0.7000   3rd Qu.:0.7000  
##  Max.   :0.64    Max.   :3.742   Max.   :0.9000   Max.   :0.9000  
##  NA's   :63805                                                    
##    ps_calc_03       ps_calc_04      ps_calc_05      ps_calc_06    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 1.000  
##  1st Qu.:0.2000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 7.000  
##  Median :0.4000   Median :2.000   Median :2.000   Median : 8.000  
##  Mean   :0.4501   Mean   :2.371   Mean   :1.885   Mean   : 7.688  
##  3rd Qu.:0.7000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :0.9000   Max.   :5.000   Max.   :6.000   Max.   :10.000  
##                                                                   
##    ps_calc_07     ps_calc_08       ps_calc_09      ps_calc_10    
##  Min.   :0.00   Min.   : 1.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:2.00   1st Qu.: 8.000   1st Qu.:1.000   1st Qu.: 6.000  
##  Median :3.00   Median : 9.000   Median :2.000   Median : 8.000  
##  Mean   :3.01   Mean   : 9.226   Mean   :2.339   Mean   : 8.443  
##  3rd Qu.:4.00   3rd Qu.:10.000   3rd Qu.:3.000   3rd Qu.:10.000  
##  Max.   :9.00   Max.   :12.000   Max.   :7.000   Max.   :25.000  
##                                                                  
##    ps_calc_11       ps_calc_12      ps_calc_13       ps_calc_14   
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 4.000   1st Qu.: 1.00   1st Qu.: 2.000   1st Qu.: 6.00  
##  Median : 5.000   Median : 1.00   Median : 3.000   Median : 7.00  
##  Mean   : 5.438   Mean   : 1.44   Mean   : 2.875   Mean   : 7.54  
##  3rd Qu.: 7.000   3rd Qu.: 2.00   3rd Qu.: 4.000   3rd Qu.: 9.00  
##  Max.   :20.000   Max.   :11.00   Max.   :15.000   Max.   :28.00  
##                                                                   
##  ps_calc_15_bin   ps_calc_16_bin   ps_calc_17_bin   ps_calc_18_bin  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.1237   Mean   :0.6278   Mean   :0.5547   Mean   :0.2878  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##  ps_calc_19_bin   ps_calc_20_bin  
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3493   Mean   :0.1524  
##  3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000  
##

The test data is around 1.5 times that of train.

glimpse(test)

## Observations: 892,816
## Variables: 58
## $ id             <int> 0, 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 14, 15, 18,...
## $ ps_ind_01      <int> 0, 4, 5, 0, 5, 0, 0, 0, 0, 1, 0, 1, 1, 3, 0, 2,...
## $ ps_ind_02_cat  <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,...
## $ ps_ind_03      <int> 8, 5, 3, 6, 7, 6, 3, 0, 7, 6, 5, 4, 2, 3, 1, 2,...
## $ ps_ind_04_cat  <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_ind_05_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 6, 0,...
## $ ps_ind_06_bin  <int> 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0,...
## $ ps_ind_07_bin  <int> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ps_ind_08_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ ps_ind_09_bin  <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_10_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15      <int> 12, 5, 10, 4, 4, 10, 11, 7, 6, 7, 3, 9, 8, 0, 8...
## $ ps_ind_16_bin  <int> 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ ps_ind_17_bin  <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_reg_01      <dbl> 0.5, 0.9, 0.4, 0.1, 0.9, 0.9, 0.1, 0.9, 0.4, 0....
## $ ps_reg_02      <dbl> 0.3, 0.5, 0.0, 0.2, 0.4, 0.5, 0.1, 1.1, 0.0, 1....
## $ ps_reg_03      <dbl> 0.6103278, 0.7713624, 0.9161741, NA, 0.8177714,...
## $ ps_car_01_cat  <int> 7, 4, 11, 7, 11, 9, 6, 7, 11, 11, 11, 11, 11, 1...
## $ ps_car_02_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_car_03_cat  <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ ps_car_04_cat  <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2,...
## $ ps_car_05_cat  <int> NA, 0, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat  <int> 1, 11, 14, 1, 11, 11, 1, 11, 2, 4, 11, 7, 6, 1,...
## $ ps_car_07_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1...
## $ ps_car_08_cat  <int> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1,...
## $ ps_car_09_cat  <int> 2, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 1, 0, 2,...
## $ ps_car_10_cat  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat  <int> 65, 103, 29, 40, 101, 11, 10, 103, 104, 104, 10...
## $ ps_car_11      <int> 1, 1, 3, 2, 3, 2, 2, 3, 2, 2, 3, 3, 2, 1, 2, 0,...
## $ ps_car_12      <dbl> 0.3162278, 0.3162278, 0.4000000, 0.3741657, 0.3...
## $ ps_car_13      <dbl> 0.6695564, 0.6063200, 0.8962387, 0.6521104, 0.8...
## $ ps_car_14      <dbl> 0.3521363, 0.3583295, 0.3984972, 0.3814446, 0.3...
## $ ps_car_15      <dbl> 3.464102, 2.828427, 3.316625, 2.449490, 3.31662...
## $ ps_calc_01     <dbl> 0.1, 0.4, 0.6, 0.1, 0.9, 0.7, 0.9, 0.8, 0.9, 0....
## $ ps_calc_02     <dbl> 0.8, 0.5, 0.6, 0.5, 0.6, 0.9, 0.8, 0.9, 0.3, 0....
## $ ps_calc_03     <dbl> 0.6, 0.4, 0.6, 0.5, 0.8, 0.4, 0.8, 0.5, 0.0, 0....
## $ ps_calc_04     <int> 1, 3, 2, 2, 3, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 1,...
## $ ps_calc_05     <int> 1, 3, 3, 1, 4, 1, 1, 2, 2, 1, 2, 2, 1, 2, 3, 4,...
## $ ps_calc_06     <int> 6, 8, 7, 7, 7, 9, 7, 8, 9, 7, 5, 9, 8, 7, 7, 8,...
## $ ps_calc_07     <int> 3, 4, 4, 3, 1, 5, 3, 4, 7, 1, 5, 3, 0, 1, 3, 3,...
## $ ps_calc_08     <int> 6, 10, 6, 12, 10, 9, 9, 11, 9, 9, 7, 10, 9, 9, ...
## $ ps_calc_09     <int> 2, 2, 3, 1, 4, 4, 5, 2, 0, 1, 2, 1, 4, 1, 4, 3,...
## $ ps_calc_10     <int> 9, 7, 12, 13, 12, 12, 6, 8, 10, 11, 10, 4, 7, 1...
## $ ps_calc_11     <int> 1, 2, 4, 5, 4, 8, 2, 3, 5, 6, 2, 4, 3, 7, 6, 9,...
## $ ps_calc_12     <int> 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 2, 4, 0, 3, 0,...
## $ ps_calc_13     <int> 1, 3, 2, 0, 0, 4, 4, 4, 4, 6, 1, 7, 0, 5, 5, 2,...
## $ ps_calc_14     <int> 12, 10, 4, 5, 4, 9, 6, 9, 6, 10, 8, 8, 12, 9, 6...
## $ ps_calc_15_bin <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ ps_calc_17_bin <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,...
## $ ps_calc_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_calc_19_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_20_bin <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

1.4 Missing values analysis

The variables with missing values along with their missing count and percentage is shown in the table below.

# Row binding train and test data
rbtt <- (train %>% select(-id,-target)) %>% bind_rows(test %>% select(-id)) 

# Missing values information
mvNos <- apply(rbtt, MARGIN = 2, 
                function(x){
                  sum(is.na(x))
                })
mvInfo <- (data.frame(predictorName = colnames(rbtt), missCount = mvNos, missPercentage = 100*mvNos/nrow(rbtt))) 
rownames(mvInfo) <- NULL
mvInfo <- mvInfo %>% filter(missCount > 0) %>% arrange(missCount)
mvInfo

There are 13 variables with missing values where variable ps_car_03_cat has the highest count and ps_car_12 having the lowest count.

The Spearman correlation plot of complete cases is shown below. Here the categorical variables are not converted to factor and used as integer. The plot include variables whose average of absolute pairwise correlation coefficients with other variables is greater than 0.3 to avoid unnecessary cluttering.

# Variable correlation matrix
correlations <- cor(rbtt, use = "complete.obs", method = "spearman")

# Variables having significant correlations
sigCorr <- findCorrelation(correlations, cutoff = .3)
corrplot(cor(rbtt %>% select(sigCorr), use = "complete.obs", method = "spearman"), order = "alphabet")

It is clear from the table and correlation plot above that variables ps_car_11, ps_ind_04_cat, ps_car_01_cat, ps_ind_02_cat, ps_car_09_cat, ps_ind_05_cat, ps_car_07_cat and ps_car_05_cat with missing values have correlations less than 0.3 in absolute value. Also since the variables are anonymized we do not know the what these variables actually stand for. Therefore instead of missing value imputation we keep “-1” as it is in the data.

rm(rbtt)

# Read train and test files with -1 as it is
train <- fread("train.csv")
test <- fread("test.csv")

# Class outcome for train
outcome <- factor(ifelse(train$target, "Claim", "NoClaim"), levels = c("Claim", "NoClaim"))

# Combining train and test files
comb <- train %>% select(-id,-target) %>% bind_rows(test %>% select(-id))
rm(train,test)

1.5 Reforming variable types

We reform variables that are categorical in nature to type factor. Some categorical variables have only two unique levels, hence we convert those to integers having value 0 or 1.

# Converting categorical variable to factor from integer
comb <- comb %>%
  mutate_at(vars(ends_with("cat")), funs(factor))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.

# Converting data types of categorical variables that are binary in nature
catv <- apply(comb %>% select(ends_with("cat")),
              MARGIN = 2,
              function(x){
                length(unique(x))
              })

comb <- comb %>% 
  mutate_at(names(catv[catv==2]), funs(as.character)) %>%
  mutate_at(names(catv[catv==2]), funs(as.integer))
rm(catv)

# Splitting data set into train and test dataframes
dtrain <- comb[1:length(outcome),] 
dtest <- comb[-(1:length(outcome)),]
rm(comb)

1.6 Near-zero variance variables

nzv <- nearZeroVar(dtrain %>% mutate(target = outcome))
dtrain %>% mutate(target = outcome) %>% select(nzv) %>% colnames()

## [1] "ps_ind_05_cat" "ps_ind_10_bin" "ps_ind_11_bin" "ps_ind_12_bin"
## [5] "ps_ind_13_bin" "ps_ind_14"     "ps_reg_03"     "ps_car_10_cat"
## [9] "target"

There are 9 near-zero variance variables. Moreover, target is also included in this list. Hence we will not remove any of these variables.

1.7 Interaction between outcome and covariates

Shown below are stacked bar plots and density plots depicting distribution of covariates by outcome levels.

1.7.1 ind group variables

f1 <- dtrain %>% 
  ggplot(aes(ps_ind_01, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f2 <- dtrain %>% 
  ggplot(aes(ps_ind_02_cat, fill = outcome)) +
  geom_bar() +
  scale_y_log10()

f3 <- dtrain %>% 
  ggplot(aes(ps_ind_03, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f4 <- dtrain %>% 
  ggplot(aes(ps_ind_04_cat, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f5 <- dtrain %>% 
  ggplot(aes(ps_ind_05_cat, fill = outcome)) +
  geom_bar() +
  scale_y_log10()

f6 <- dtrain %>% 
  ggplot(aes(ps_ind_06_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f7 <- dtrain %>% 
  ggplot(aes(ps_ind_07_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f8 <- dtrain %>% 
  ggplot(aes(ps_ind_08_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f9 <- dtrain %>% 
  ggplot(aes(ps_ind_09_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f10 <- dtrain %>% 
  ggplot(aes(ps_ind_10_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f11 <- dtrain %>% 
  ggplot(aes(ps_ind_11_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f12 <- dtrain %>% 
  ggplot(aes(ps_ind_12_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f13 <- dtrain %>% 
  ggplot(aes(ps_ind_13_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f14 <- dtrain %>% 
  ggplot(aes(ps_ind_14, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f15 <- dtrain %>% 
  ggplot(aes(ps_ind_15, fill = outcome)) +
  geom_bar() +
  scale_y_log10()

f16 <- dtrain %>% 
  ggplot(aes(ps_ind_16_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f17 <- dtrain %>% 
  ggplot(aes(ps_ind_17_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

f18 <- dtrain %>% 
  ggplot(aes(ps_ind_18_bin, fill = outcome)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none")

lay <- rbind(c(1,1,1,2,2),
             c(3,3,3,4,4))
grid.arrange(f1, f2, f3, f4, layout_matrix = lay)

lay <- rbind(c(1,1,1,2,3),
             c(4,5,6,7,8))
grid.arrange(f5, f6, f7, f8, f9, f10, f11, f12, layout_matrix = lay)

lay <- rbind(c(1,2,2,2,2),
             c(3,3,4,5,6))
grid.arrange(f13, f15, f14, f16, f17, f18, layout_matrix = lay)

Apparently all levels of ind variables cannot discern either level of outcome clearly, except level 4 in ps_ind_14 belong to NoClaim. But since proportion of level 4 is relatively small, this finding has no significant evidence.

1.7.2 reg group variables

f19 <- dtrain %>%
  ggplot(aes(ps_reg_01, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05)

f20 <- dtrain %>%
  ggplot(aes(ps_reg_02, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05)

f21 <- dtrain %>%
  ggplot(aes(ps_reg_03, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05)

grid.arrange(f19, f20, f21, nrow = 3)

The reg variables distributions are different for outcome levels. They can prove to be useful for discerning the outcome.

1.7.3 car group variables

f22 <- dtrain %>% 
  ggplot(aes(ps_car_01_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f23 <- dtrain %>% 
  ggplot(aes(ps_car_02_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() 

f24 <- dtrain %>% 
  ggplot(aes(ps_car_03_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f25 <- dtrain %>% 
  ggplot(aes(ps_car_04_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f26 <- dtrain %>% 
  ggplot(aes(ps_car_05_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f27 <- dtrain %>% 
  ggplot(aes(ps_car_06_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f28 <- dtrain %>% 
  ggplot(aes(ps_car_07_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f29 <- dtrain %>% 
  ggplot(aes(ps_car_08_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f30 <- dtrain %>% 
  ggplot(aes(ps_car_09_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f31 <- dtrain %>% 
  ggplot(aes(ps_car_10_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f32 <- dtrain %>% 
  ggplot(aes(ps_car_11_cat, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f33 <- dtrain %>% 
  ggplot(aes(ps_car_11, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f34 <- dtrain %>%
  ggplot(aes(ps_car_12, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

f35 <- dtrain %>%
  ggplot(aes(ps_car_13, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

f36 <- dtrain %>%
  ggplot(aes(ps_car_14, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

f37 <- dtrain %>%
  ggplot(aes(ps_car_15, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

lay <- rbind(c(1,1,1,2,2),
             c(3,3,4,4,4))
grid.arrange(f22, f23, f24, f25, layout_matrix = lay)

lay <- rbind(c(1,1,2,2,3),
             c(4,4,4,5,5))
grid.arrange(f26, f28, f29, f27, f30, layout_matrix = lay)

lay <- rbind(c(1,1,2,2,NA),
             c(3,3,3,3,3))
grid.arrange(f31, f33, f32, layout_matrix = lay)

lay <- rbind(c(1,1,2,2),
             c(3,3,4,4))
grid.arrange(f34, f35, f36, f37, layout_matrix = lay)

We observe that

The missing values of varibles ps_car_02 and ps_car_11 belong to the outcome level NoClaim. But because of their low proportion evidence is weak.
Though variable ps_car_11_cat is categorical it has a whopping 104 levels.
The variables ps_car_13 and ps_car_15 show difference in their distributions for levels of outcome as compared to that of variables ps_car_12 and ps_car_14.

1.7.4 calc group variables

f38 <- dtrain %>%
  ggplot(aes(ps_calc_01, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

f39 <- dtrain %>%
  ggplot(aes(ps_calc_02, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

f40 <- dtrain %>%
  ggplot(aes(ps_calc_03, fill = outcome)) +
  geom_density(alpha = 0.5, bw = 0.05) +
  theme(legend.position = "none")

f41 <- dtrain %>% 
  ggplot(aes(ps_calc_04, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f42 <- dtrain %>% 
  ggplot(aes(ps_calc_05, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f43 <- dtrain %>% 
  ggplot(aes(ps_calc_06, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f44 <- dtrain %>% 
  ggplot(aes(ps_calc_07, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f45 <- dtrain %>% 
  ggplot(aes(ps_calc_08, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f46 <- dtrain %>% 
  ggplot(aes(ps_calc_09, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f47 <- dtrain %>% 
  ggplot(aes(ps_calc_10, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f48 <- dtrain %>% 
  ggplot(aes(ps_calc_11, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f49 <- dtrain %>% 
  ggplot(aes(ps_calc_12, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f50 <- dtrain %>% 
  ggplot(aes(ps_calc_13, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f51 <- dtrain %>% 
  ggplot(aes(ps_calc_14, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f52 <- dtrain %>% 
  ggplot(aes(ps_calc_15_bin, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f53 <- dtrain %>% 
  ggplot(aes(ps_calc_16_bin, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f54 <- dtrain %>% 
  ggplot(aes(ps_calc_17_bin, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f55 <- dtrain %>% 
  ggplot(aes(ps_calc_18_bin, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f56 <- dtrain %>% 
  ggplot(aes(ps_calc_19_bin, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

f57 <- dtrain %>% 
  ggplot(aes(ps_calc_20_bin, fill = outcome)) +
  geom_bar() + scale_y_log10() +
  theme(legend.position = "none")

lay <- rbind(c(1,1,2,2,3,3),
             c(4,4,4,5,5,5))
grid.arrange(f38, f39, f40, f41, f42, layout_matrix = lay)

lay <- rbind(c(1,1,2,2),
             c(3,3,4,4))
grid.arrange(f43, f44, f45, f46, layout_matrix = lay)

lay <- rbind(c(1,1,2,2),
             c(3,3,4,4))
grid.arrange(f47, f48, f49, f50, layout_matrix = lay)

lay <- rbind(c(1,1,1,1,2),
             c(3,4,5,6,7))
grid.arrange(f51, f52, f53, f54, f55, f56, f57, layout_matrix = lay)

We observe that almost all levels of the calc variables have same proportion of outcome levels. Hence it seems that all the calc variables have more of noise than signal.

2 Measuring Univariate Feature Importance

We now separate our covariates into categorical and numerical types.

# Numerical variables
numVars <- dtrain %>% select(-ends_with("bin"),-ends_with("cat")) %>% colnames()

# Categorical variables
catVars <- dtrain %>% select(ends_with("bin"),ends_with("cat")) %>% colnames()

2.1 Numerical type covariates

A two-tailed Wilcoxon rank sum test is applied followed by Benjamini and Hochberg p-value correction to numerical type covariates. In this case, the null hypothesis is that the distribution of a covariate for the levels of outcome differ by a location shift of 0 .

# Apply Wilcoxon rank sum test on each numerical type covariate
wrst <- apply(dtrain %>% select(numVars), 2,
              function(x, y)
              {
                wxn <- wilcox.test(x ~ y, conf.int = T)[c("statistic", "p.value")]
                unlist(wxn)
              },
              y = ifelse(outcome == "Claim", 1,0))
wrst <- as.data.frame(t(wrst))
names(wrst) <- c("W.Statistic", "W.test_p.value")
wrst$p.adj <- p.adjust(wrst$W.test_p.value , method = "fdr")
wrst$Predictor <- rownames(wrst)

# Scatter plot where each point is a variable and whose coordinates are transformed 
# adj.p value and W. statistic
wrst_plot <- wrst %>% mutate(Significance=ifelse(wrst$p.adj<0.01, "p.adj<0.01", "Not Significant"),
                             W.Statistic_norm = W.Statistic/max(W.Statistic)) %>%
  ggplot(aes(x = W.Statistic_norm, y = -log10(p.adj), color = Significance)) +
  geom_point(aes(text = (paste("Predictor:", Predictor, "<br>p.adj:", p.adj))), alpha = 0.4, size = 4) +
  labs(x = "Normalize W. Statistic", y = "-log10(p.adj)")
ggplotly(wrst_plot, tooltip = c("text"))

From the above plot we see that in this category ps_reg_03 has high potential to differentiate between the outcome levels, which is also supported by the density plot of ps_reg_03. All calc variables in this category show no signal whatsoever.

2.2 Categorical type covariates

A one-tailed Pearson’s Chi-squared test is applied followed by Benjamini and Hochberg p-value correction to categorical type covariates. In this case, the null hypothesis is that the frequency distribution of a covariate is same for both levels of the outcome.

# Function to compute contingency table and Chi squared test statistic with its p-value
tableCalcs <- function(x, y)
{
  tab <- table(x, y)
  cst <- chisq.test(tab)
  out <- c(statistic = cst$statistic,
           P = cst$p.value)
}

# Apply Wilcoxon rank sum test on each numerical type covariate
chisqt <- apply(dtrain %>% select(catVars), 2, tableCalcs, y = outcome)
chisqt <- as.data.frame(t(chisqt))
names(chisqt) <- c("Chisq.Statistic", "Chisq.test_p.value")
chisqt$p.adj <- p.adjust(chisqt$Chisq.test_p.value , method = "fdr")
chisqt$Predictor <- rownames(chisqt)

# Scatter plot where each point is a variable and whose coordinates are transformed 
# adj.p value and Chisq.Statistic
chisqt_plot <- chisqt %>% mutate(Significance=ifelse(chisqt$p.adj<0.01, "p.adj<0.01", "Not Significant")) %>%
  ggplot(aes(x = log(Chisq.Statistic), y = -log10(p.adj), color = Significance)) +
  geom_point(aes(text = (paste("Predictor:", Predictor, "<br>p.adj:", p.adj))), alpha = 0.4, size = 4) +
  labs(x = "log(Chisq.Statistic)", y = "-log10(p.adj)")
ggplotly(chisqt_plot, tooltip = c("text"))

From the above plot we see that in this category ps_car_11_cat is ranked relatively high as compared to others. Again none of the calc variables are deemed important by this test. There also seems a deterministic relation between the transformed adjusted p-value and Chi-squared test statistic.

3 Feature Selection

For both type of covariates we select those whose associated adjusted p-value is less than 0.01. The categorical covariates are further decomposed into binary by dummification.

# Selecting predictors passing the tests
sig_.01_f <- c(wrst %>% filter(p.adj < 0.01) %>% .$Predictor, 
                chisqt %>% filter(p.adj < 0.01) %>% .$Predictor)
dtrain <- dtrain %>% select(sig_.01_f)
dtest <- dtest %>% select(sig_.01_f)

# One hot encoding categorical covariates
dmy <- dummyVars(" ~ .", data = dtrain)
dtrain <- data.frame(predict(dmy, newdata = dtrain))
dtest <- data.frame(predict(dmy, newdata = dtest))

The dimensions of the train and test data after the above process are

dim(dtrain)

## [1] 595212    199

dim(dtest)

## [1] 892816    199

4 Model Building and Ensembling

Extreme Gradient Boosting tree based method is used for building models. The competition evaluation metric was Normalized Gini Coefficient, that is \(2*AUC -1\), where AUC is the area under receiver operating characteristic curve. For tuning the models we use AUC as performance metric along with 6-fold cross validation scheme. Two models with different hyperparameter values are build and their predictions are ensembled by taking their harmonic mean.

4.1 Model #1

set.seed(8765)
ps_xgbt_6f_8765 <- train(x = dtrain, y = outcome,
                         method = "xgbTree",
                         metric = "ROC",
                         tuneGrid = expand.grid(nrounds = seq(50,3000,50),
                                        max_depth = 4,
                                        eta = 0.01,
                                        gamma = 0.3,
                                        colsample_bytree = 0.9,
                                        min_child_weight = 0,
                                        subsample = 0.65),
                         trControl = trainControl(method = "cv",
                                          number = 6,
                                          classProbs = TRUE,
                                          summaryFunction = twoClassSummary,
                                          search = "grid"),
                         alpha = 10)
    
ps_xgbt_6f_8765

## eXtreme Gradient Boosting 
## 
## 595212 samples
##    199 predictor
##      2 classes: 'Claim', 'NoClaim' 
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold) 
## Summary of sample sizes: 496011, 496010, 496009, 496010, 496009, 496011, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  ROC        Sens          Spec     
##     50     0.6014241  0.000000e+00  1.0000000
##    100     0.6073256  0.000000e+00  1.0000000
##    150     0.6115685  0.000000e+00  1.0000000
##    200     0.6150729  0.000000e+00  1.0000000
##    250     0.6189560  0.000000e+00  1.0000000
##    300     0.6228368  0.000000e+00  1.0000000
##    350     0.6264750  0.000000e+00  1.0000000
##    400     0.6292653  0.000000e+00  1.0000000
##    450     0.6316193  0.000000e+00  1.0000000
##    500     0.6333705  0.000000e+00  1.0000000
##    550     0.6348569  0.000000e+00  1.0000000
##    600     0.6359547  0.000000e+00  1.0000000
##    650     0.6370333  0.000000e+00  1.0000000
##    700     0.6377656  0.000000e+00  1.0000000
##    750     0.6384864  0.000000e+00  1.0000000
##    800     0.6391282  0.000000e+00  1.0000000
##    850     0.6396945  0.000000e+00  1.0000000
##    900     0.6401506  0.000000e+00  1.0000000
##    950     0.6405374  0.000000e+00  1.0000000
##   1000     0.6409591  0.000000e+00  1.0000000
##   1050     0.6413068  0.000000e+00  1.0000000
##   1100     0.6416114  0.000000e+00  1.0000000
##   1150     0.6418509  0.000000e+00  1.0000000
##   1200     0.6420609  0.000000e+00  1.0000000
##   1250     0.6422741  0.000000e+00  1.0000000
##   1300     0.6424910  0.000000e+00  0.9999983
##   1350     0.6426626  4.610420e-05  0.9999983
##   1400     0.6428099  4.610420e-05  0.9999983
##   1450     0.6429623  9.220839e-05  0.9999983
##   1500     0.6431275  9.220839e-05  0.9999983
##   1550     0.6432479  9.220839e-05  0.9999983
##   1600     0.6433807  9.220839e-05  0.9999983
##   1650     0.6434876  9.220839e-05  0.9999983
##   1700     0.6435769  9.220839e-05  0.9999983
##   1750     0.6436743  9.220839e-05  0.9999983
##   1800     0.6437844  9.220839e-05  0.9999983
##   1850     0.6438632  9.220839e-05  0.9999983
##   1900     0.6439749  9.220839e-05  0.9999983
##   1950     0.6440682  9.220839e-05  0.9999983
##   2000     0.6441336  9.220839e-05  0.9999983
##   2050     0.6441769  9.220839e-05  0.9999983
##   2100     0.6442097  9.220839e-05  0.9999983
##   2150     0.6442460  9.220839e-05  0.9999983
##   2200     0.6442260  9.220839e-05  0.9999983
##   2250     0.6442576  9.220839e-05  0.9999983
##   2300     0.6442793  9.220839e-05  0.9999983
##   2350     0.6443322  9.220839e-05  0.9999983
##   2400     0.6443650  9.220839e-05  0.9999983
##   2450     0.6443842  9.220839e-05  0.9999983
##   2500     0.6443695  9.220839e-05  0.9999983
##   2550     0.6443812  9.220839e-05  0.9999983
##   2600     0.6443704  9.220839e-05  0.9999983
##   2650     0.6444024  9.220839e-05  0.9999983
##   2700     0.6444027  9.220839e-05  0.9999983
##   2750     0.6444069  9.220839e-05  0.9999965
##   2800     0.6444012  9.220839e-05  0.9999965
##   2850     0.6443677  9.220839e-05  0.9999965
##   2900     0.6443400  9.220839e-05  0.9999965
##   2950     0.6443642  9.220839e-05  0.9999965
##   3000     0.6443598  9.220839e-05  0.9999965
## 
## Tuning parameter 'max_depth' was held constant at a value of 4
##  0.9
## Tuning parameter 'min_child_weight' was held constant at a value of
##  0
## Tuning parameter 'subsample' was held constant at a value of 0.65
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 2750, max_depth =
##  4, eta = 0.01, gamma = 0.3, colsample_bytree = 0.9, min_child_weight =
##  0 and subsample = 0.65.

The cross-validated Normalized Gini Coefficient for the above model is 0.2888138.

4.2 Model #2

set.seed(9876)
ps_xgbt_6f_9876 <- train(x = dtrain, y = outcome,
                         method = "xgbTree",
                         metric = "ROC",
                         tuneGrid = expand.grid(nrounds = seq(50,4000,50),
                                        max_depth = 4,
                                        eta = 0.01,
                                        gamma = 0,
                                        colsample_bytree = 0.9,
                                        min_child_weight = 3,
                                        subsample = 0.5),
                         trControl = trainControl(method = "cv",
                                          number = 6,
                                          classProbs = TRUE,
                                          summaryFunction = twoClassSummary,
                                          search = "grid"),
                         alpha = 10)
    
ps_xgbt_6f_9876

## eXtreme Gradient Boosting 
## 
## 595212 samples
##    199 predictor
##      2 classes: 'Claim', 'NoClaim' 
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold) 
## Summary of sample sizes: 496010, 496011, 496009, 496011, 496009, 496010, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  ROC        Sens          Spec     
##     50     0.5992751  0.000000e+00  1.0000000
##    100     0.6067106  0.000000e+00  1.0000000
##    150     0.6114146  0.000000e+00  1.0000000
##    200     0.6149008  0.000000e+00  1.0000000
##    250     0.6188799  0.000000e+00  1.0000000
##    300     0.6227925  0.000000e+00  1.0000000
##    350     0.6260832  0.000000e+00  1.0000000
##    400     0.6290621  0.000000e+00  1.0000000
##    450     0.6314908  0.000000e+00  1.0000000
##    500     0.6332994  0.000000e+00  1.0000000
##    550     0.6347740  0.000000e+00  1.0000000
##    600     0.6360976  0.000000e+00  1.0000000
##    650     0.6370692  0.000000e+00  1.0000000
##    700     0.6379984  0.000000e+00  1.0000000
##    750     0.6387514  0.000000e+00  1.0000000
##    800     0.6392878  0.000000e+00  1.0000000
##    850     0.6398697  0.000000e+00  1.0000000
##    900     0.6403674  0.000000e+00  1.0000000
##    950     0.6408321  0.000000e+00  1.0000000
##   1000     0.6411665  0.000000e+00  1.0000000
##   1050     0.6414965  0.000000e+00  1.0000000
##   1100     0.6418160  0.000000e+00  1.0000000
##   1150     0.6421040  0.000000e+00  1.0000000
##   1200     0.6423593  0.000000e+00  1.0000000
##   1250     0.6425700  0.000000e+00  1.0000000
##   1300     0.6427882  0.000000e+00  1.0000000
##   1350     0.6429183  0.000000e+00  1.0000000
##   1400     0.6430525  0.000000e+00  1.0000000
##   1450     0.6432132  0.000000e+00  1.0000000
##   1500     0.6433660  0.000000e+00  1.0000000
##   1550     0.6435067  0.000000e+00  1.0000000
##   1600     0.6436087  0.000000e+00  1.0000000
##   1650     0.6436954  0.000000e+00  1.0000000
##   1700     0.6437715  0.000000e+00  1.0000000
##   1750     0.6438921  0.000000e+00  1.0000000
##   1800     0.6439633  0.000000e+00  1.0000000
##   1850     0.6440385  0.000000e+00  1.0000000
##   1900     0.6441198  0.000000e+00  1.0000000
##   1950     0.6441692  0.000000e+00  1.0000000
##   2000     0.6442315  0.000000e+00  1.0000000
##   2050     0.6442940  0.000000e+00  1.0000000
##   2100     0.6443854  0.000000e+00  1.0000000
##   2150     0.6444275  0.000000e+00  1.0000000
##   2200     0.6444394  0.000000e+00  1.0000000
##   2250     0.6444586  0.000000e+00  1.0000000
##   2300     0.6444788  0.000000e+00  1.0000000
##   2350     0.6445244  0.000000e+00  1.0000000
##   2400     0.6445915  0.000000e+00  1.0000000
##   2450     0.6446100  0.000000e+00  1.0000000
##   2500     0.6446240  0.000000e+00  1.0000000
##   2550     0.6446438  0.000000e+00  1.0000000
##   2600     0.6446664  0.000000e+00  0.9999983
##   2650     0.6446694  0.000000e+00  0.9999983
##   2700     0.6446783  0.000000e+00  0.9999983
##   2750     0.6447230  0.000000e+00  0.9999983
##   2800     0.6447030  9.219564e-05  0.9999983
##   2850     0.6447186  9.219564e-05  0.9999983
##   2900     0.6447465  9.219564e-05  0.9999983
##   2950     0.6447184  9.219564e-05  0.9999983
##   3000     0.6447291  9.219564e-05  0.9999983
##   3050     0.6447476  9.219564e-05  0.9999983
##   3100     0.6447402  1.382871e-04  0.9999983
##   3150     0.6447525  9.219564e-05  0.9999983
##   3200     0.6447079  1.843785e-04  0.9999983
##   3250     0.6446704  1.843785e-04  0.9999983
##   3300     0.6446563  1.843785e-04  0.9999983
##   3350     0.6446641  1.382871e-04  0.9999983
##   3400     0.6446431  9.219564e-05  0.9999983
##   3450     0.6446492  9.219564e-05  0.9999983
##   3500     0.6446198  1.382871e-04  0.9999983
##   3550     0.6446332  1.382871e-04  0.9999983
##   3600     0.6446014  1.843785e-04  0.9999983
##   3650     0.6445874  1.382871e-04  0.9999983
##   3700     0.6445727  1.843785e-04  0.9999983
##   3750     0.6445566  1.382871e-04  0.9999983
##   3800     0.6445549  1.382871e-04  0.9999983
##   3850     0.6445344  1.382871e-04  0.9999983
##   3900     0.6445038  1.382871e-04  0.9999983
##   3950     0.6444976  1.843785e-04  0.9999983
##   4000     0.6444753  1.843785e-04  0.9999983
## 
## Tuning parameter 'max_depth' was held constant at a value of 4
##  0.9
## Tuning parameter 'min_child_weight' was held constant at a value of
##  3
## Tuning parameter 'subsample' was held constant at a value of 0.5
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 3150, max_depth =
##  4, eta = 0.01, gamma = 0, colsample_bytree = 0.9, min_child_weight =
##  3 and subsample = 0.5.

The cross-validated Normalized Gini Coefficient for the above model is 0.289505.

4.3 Writing submission files

We now take harmonic mean of predictions of the above mentioned two models and write the submission files. We get a small improvement in the performance metric using the harmonic mean of the above two models.

# Model #1 predictions
p_8765 <- predict(ps_xgbt_6f_8765, dtest, type = "prob")[,"Claim"]
sub_ps_xgbt_6f_8765 <- fread("sample_submission.csv") %>% mutate(target = p_8765)
write_excel_csv(sub_ps_xgbt_6f_8765, "sub_ps_xgbt_6f_8765.csv")

# Model #2 predictions
p_9876 <- predict(ps_xgbt_6f_9876, dtest, type = "prob")[,"Claim"]
sub_ps_xgbt_6f_9876 <- fread("sample_submission.csv") %>% mutate(target = p_9876)
write_excel_csv(sub_ps_xgbt_6f_9876, "sub_ps_xgbt_6f_9876.csv")

# Harmonic mean of predictions
hm2 <- 1/((1/p_8765 + 1/p_9876)/2)
sub_ps_xgbt_hm2 <- fread("sample_submission.csv") %>% mutate(target = hm2)
write_excel_csv(sub_ps_xgbt_hm2, "sub_ps_xgbt_hm2.csv")

The snapshot of private leaderboard scores are

5 Conclusion

The calc variables were totally removed from the selection process which proves that they would only contribute noise if added to the feature set. Since the dataset was imbalanced, AUC was used as performance metric for tuning the hyperparameters of the xgboost models. It is apparent from the private leaderboard scores that the given dataset was hard to learn from. The proposed solution in this document shows an improvement over the \(35^{th}\) place solution on the private leaderboard, which is among the top 0.7% solutions.

Porto Seguro’s Safe Driver Prediction

Vishal Modagekar