The aim of this challenge is to predict the probability that a driver will initiate an auto insurance claim in the next year. These predictions will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers. This competition was one of the most popular on Kaggle in the featured category.
In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.
# Required libraries
library(tidyverse)
library(data.table)
library(caret)
library(gridExtra)
library(plotly)
library(corrplot)
library(DT)# Reading data files with replacing -1 as NA
train <- fread("train.csv", na = c("-1","-1.0"))
test <- fread("test.csv", na = c("-1","-1.0"))It is clear from the variable target that the distribution of class labels is highly imbalanced with the positive class amounting to only 3.645 percent.
summary(train)## id target ps_ind_01 ps_ind_02_cat
## Min. : 7 Min. :0.00000 Min. :0.0 Min. :1.00
## 1st Qu.: 371992 1st Qu.:0.00000 1st Qu.:0.0 1st Qu.:1.00
## Median : 743548 Median :0.00000 Median :1.0 Median :1.00
## Mean : 743804 Mean :0.03645 Mean :1.9 Mean :1.36
## 3rd Qu.:1115549 3rd Qu.:0.00000 3rd Qu.:3.0 3rd Qu.:2.00
## Max. :1488027 Max. :1.00000 Max. :7.0 Max. :4.00
## NA's :216
## ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin
## Min. : 0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median : 4.000 Median :0.000 Median :0.000 Median :0.0000
## Mean : 4.423 Mean :0.417 Mean :0.419 Mean :0.3937
## 3rd Qu.: 6.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.0000
## Max. :11.000 Max. :1.000 Max. :6.000 Max. :1.0000
## NA's :83 NA's :5809
## ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.000000
## Mean :0.257 Mean :0.1639 Mean :0.1853 Mean :0.000373
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.000000
##
## ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin
## Min. :0.000000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.000000 Median :0.0000000
## Mean :0.001692 Mean :0.009439 Mean :0.0009476
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000
##
## ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin
## Min. :0.00000 Min. : 0.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 5.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median : 7.0 Median :1.0000 Median :0.0000
## Mean :0.01245 Mean : 7.3 Mean :0.6608 Mean :0.1211
## 3rd Qu.:0.00000 3rd Qu.:10.0 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :4.00000 Max. :13.0 Max. :1.0000 Max. :1.0000
##
## ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.06
## 1st Qu.:0.0000 1st Qu.:0.400 1st Qu.:0.2000 1st Qu.:0.63
## Median :0.0000 Median :0.700 Median :0.3000 Median :0.80
## Mean :0.1534 Mean :0.611 Mean :0.4392 Mean :0.89
## 3rd Qu.:0.0000 3rd Qu.:0.900 3rd Qu.:0.6000 3rd Qu.:1.08
## Max. :1.0000 Max. :0.900 Max. :1.8000 Max. :4.04
## NA's :107772
## ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat
## Min. : 0.000 Min. :0.0000 Min. :0.0 Min. :0.0000
## 1st Qu.: 7.000 1st Qu.:1.0000 1st Qu.:0.0 1st Qu.:0.0000
## Median : 7.000 Median :1.0000 Median :1.0 Median :0.0000
## Mean : 8.298 Mean :0.8299 Mean :0.6 Mean :0.7252
## 3rd Qu.:11.000 3rd Qu.:1.0000 3rd Qu.:1.0 3rd Qu.:0.0000
## Max. :11.000 Max. :1.0000 Max. :1.0 Max. :9.0000
## NA's :107 NA's :5 NA's :411231
## ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat
## Min. :0.00 Min. : 0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.00 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.0000
## Median :1.00 Median : 7.000 Median :1.000 Median :1.0000
## Mean :0.53 Mean : 6.555 Mean :0.948 Mean :0.8321
## 3rd Qu.:1.00 3rd Qu.:11.000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.00 Max. :17.000 Max. :1.000 Max. :1.0000
## NA's :266551 NA's :11489
## ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11
## Min. :0.000 Min. :0.0000 Min. : 1.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.: 32.00 1st Qu.:2.000
## Median :2.000 Median :1.0000 Median : 65.00 Median :3.000
## Mean :1.331 Mean :0.9921 Mean : 62.22 Mean :2.346
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 93.00 3rd Qu.:3.000
## Max. :4.000 Max. :2.0000 Max. :104.00 Max. :3.000
## NA's :569 NA's :5
## ps_car_12 ps_car_13 ps_car_14 ps_car_15
## Min. :0.1000 Min. :0.2506 Min. :0.11 Min. :0.000
## 1st Qu.:0.3162 1st Qu.:0.6709 1st Qu.:0.35 1st Qu.:2.828
## Median :0.3742 Median :0.7658 Median :0.37 Median :3.317
## Mean :0.3799 Mean :0.8133 Mean :0.37 Mean :3.066
## 3rd Qu.:0.4000 3rd Qu.:0.9062 3rd Qu.:0.40 3rd Qu.:3.606
## Max. :1.2649 Max. :3.7206 Max. :0.64 Max. :3.742
## NA's :1 NA's :42620
## ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.2000 1st Qu.:0.2000 1st Qu.:0.2000 1st Qu.:2.000
## Median :0.5000 Median :0.4000 Median :0.5000 Median :2.000
## Mean :0.4498 Mean :0.4496 Mean :0.4498 Mean :2.372
## 3rd Qu.:0.7000 3rd Qu.:0.7000 3rd Qu.:0.7000 3rd Qu.:3.000
## Max. :0.9000 Max. :0.9000 Max. :0.9000 Max. :5.000
##
## ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08
## Min. :0.000 Min. : 0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:1.000 1st Qu.: 7.000 1st Qu.:2.000 1st Qu.: 8.000
## Median :2.000 Median : 8.000 Median :3.000 Median : 9.000
## Mean :1.886 Mean : 7.689 Mean :3.006 Mean : 9.226
## 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.:4.000 3rd Qu.:10.000
## Max. :6.000 Max. :10.000 Max. :9.000 Max. :12.000
##
## ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12
## Min. :0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.: 4.000 1st Qu.: 1.000
## Median :2.000 Median : 8.000 Median : 5.000 Median : 1.000
## Mean :2.339 Mean : 8.434 Mean : 5.441 Mean : 1.442
## 3rd Qu.:3.000 3rd Qu.:10.000 3rd Qu.: 7.000 3rd Qu.: 2.000
## Max. :7.000 Max. :25.000 Max. :19.000 Max. :10.000
##
## ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin
## Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: 6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median : 7.000 Median :0.0000 Median :1.0000
## Mean : 2.872 Mean : 7.539 Mean :0.1224 Mean :0.6278
## 3rd Qu.: 4.000 3rd Qu.: 9.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :13.000 Max. :23.000 Max. :1.0000 Max. :1.0000
##
## ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.5542 Mean :0.2872 Mean :0.349 Mean :0.1533
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
##
glimpse(train)## Observations: 595,212
## Variables: 59
## $ id <int> 7, 9, 13, 16, 17, 19, 20, 22, 26, 28, 34, 35, 3...
## $ target <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_01 <int> 2, 1, 5, 0, 0, 5, 2, 5, 5, 1, 5, 2, 2, 1, 5, 5,...
## $ ps_ind_02_cat <int> 2, 1, 4, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
## $ ps_ind_03 <int> 5, 7, 9, 2, 0, 4, 3, 4, 3, 2, 2, 3, 1, 3, 11, 3...
## $ ps_ind_04_cat <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ ps_ind_05_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_06_bin <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_07_bin <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,...
## $ ps_ind_08_bin <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ ps_ind_09_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ ps_ind_10_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15 <int> 11, 3, 12, 8, 9, 6, 8, 13, 6, 4, 3, 9, 10, 12, ...
## $ ps_ind_16_bin <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ ps_ind_17_bin <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ ps_reg_01 <dbl> 0.7, 0.8, 0.0, 0.9, 0.7, 0.9, 0.6, 0.7, 0.9, 0....
## $ ps_reg_02 <dbl> 0.2, 0.4, 0.0, 0.2, 0.6, 1.8, 0.1, 0.4, 0.7, 1....
## $ ps_reg_03 <dbl> 0.7180703, 0.7660777, NA, 0.5809475, 0.8407586,...
## $ ps_car_01_cat <int> 10, 11, 7, 7, 11, 10, 6, 11, 10, 11, 11, 11, 6,...
## $ ps_car_02_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
## $ ps_car_03_cat <int> NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, NA, NA, NA...
## $ ps_car_04_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 9,...
## $ ps_car_05_cat <int> 1, NA, NA, 1, NA, 0, 1, 0, 1, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat <int> 4, 11, 14, 11, 14, 14, 11, 11, 14, 14, 13, 11, ...
## $ ps_car_07_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_08_cat <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ ps_car_09_cat <int> 0, 2, 2, 3, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0,...
## $ ps_car_10_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat <int> 12, 19, 60, 104, 82, 104, 99, 30, 68, 104, 20, ...
## $ ps_car_11 <int> 2, 3, 1, 1, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2,...
## $ ps_car_12 <dbl> 0.4000000, 0.3162278, 0.3162278, 0.3741657, 0.3...
## $ ps_car_13 <dbl> 0.8836789, 0.6188165, 0.6415857, 0.5429488, 0.5...
## $ ps_car_14 <dbl> 0.3708099, 0.3887158, 0.3472751, 0.2949576, 0.3...
## $ ps_car_15 <dbl> 3.605551, 2.449490, 3.316625, 2.000000, 2.00000...
## $ ps_calc_01 <dbl> 0.6, 0.3, 0.5, 0.6, 0.4, 0.7, 0.2, 0.1, 0.9, 0....
## $ ps_calc_02 <dbl> 0.5, 0.1, 0.7, 0.9, 0.6, 0.8, 0.6, 0.5, 0.8, 0....
## $ ps_calc_03 <dbl> 0.2, 0.3, 0.1, 0.1, 0.0, 0.4, 0.5, 0.1, 0.6, 0....
## $ ps_calc_04 <int> 3, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 2,...
## $ ps_calc_05 <int> 1, 1, 2, 4, 2, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1,...
## $ ps_calc_06 <int> 10, 9, 9, 7, 6, 8, 8, 7, 7, 8, 8, 8, 8, 10, 8, ...
## $ ps_calc_07 <int> 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 2, 2, 4, 1, 2, 5,...
## $ ps_calc_08 <int> 10, 8, 8, 8, 10, 11, 8, 6, 9, 9, 9, 10, 11, 8, ...
## $ ps_calc_09 <int> 1, 1, 2, 4, 2, 3, 3, 1, 4, 1, 4, 1, 1, 3, 3, 2,...
## $ ps_calc_10 <int> 5, 7, 7, 2, 12, 8, 10, 13, 11, 11, 7, 8, 9, 8, ...
## $ ps_calc_11 <int> 9, 3, 4, 2, 3, 4, 3, 7, 4, 3, 6, 9, 6, 2, 4, 5,...
## $ ps_calc_12 <int> 1, 1, 2, 2, 1, 2, 0, 1, 2, 5, 3, 2, 3, 0, 1, 2,...
## $ ps_calc_13 <int> 5, 1, 7, 4, 1, 0, 0, 3, 1, 0, 3, 1, 3, 4, 3, 6,...
## $ ps_calc_14 <int> 8, 9, 7, 9, 3, 9, 10, 6, 5, 6, 6, 10, 8, 3, 9, ...
## $ ps_calc_15_bin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,...
## $ ps_calc_17_bin <int> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,...
## $ ps_calc_18_bin <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ ps_calc_19_bin <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,...
## $ ps_calc_20_bin <int> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
The summary of test data is similar to that of train.
summary(test)## id ps_ind_01 ps_ind_02_cat ps_ind_03
## Min. : 0 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 372022 1st Qu.:0.000 1st Qu.:1.000 1st Qu.: 2.000
## Median : 744307 Median :1.000 Median :1.000 Median : 4.000
## Mean : 744154 Mean :1.902 Mean :1.359 Mean : 4.414
## 3rd Qu.:1116309 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 6.000
## Max. :1488026 Max. :7.000 Max. :4.000 Max. :11.000
## NA's :307
## ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.4176 Mean :0.422 Mean :0.3932 Mean :0.2572
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.000 Max. :1.0000 Max. :1.0000
## NA's :145 NA's :8710
## ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.0000 Median :0.0000 Median :0.000000 Median :0.000000
## Mean :0.1637 Mean :0.1859 Mean :0.000373 Mean :0.001595
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.000000
##
## ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. : 0.000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.: 5.000
## Median :0.000000 Median :0.000000 Median :0.00000 Median : 7.000
## Mean :0.009376 Mean :0.001039 Mean :0.01238 Mean : 7.297
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:10.000
## Max. :1.000000 Max. :1.000000 Max. :4.00000 Max. :13.000
##
## ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.4000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.7000
## Mean :0.6606 Mean :0.1204 Mean :0.155 Mean :0.6111
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.9000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :0.9000
##
## ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat
## Min. :0.0000 Min. :0.06 Min. : 0.000 Min. :0.00
## 1st Qu.:0.2000 1st Qu.:0.63 1st Qu.: 7.000 1st Qu.:1.00
## Median :0.3000 Median :0.80 Median : 7.000 Median :1.00
## Mean :0.4399 Mean :0.89 Mean : 8.294 Mean :0.83
## 3rd Qu.:0.6000 3rd Qu.:1.09 3rd Qu.:11.000 3rd Qu.:1.00
## Max. :1.8000 Max. :4.42 Max. :11.000 Max. :1.00
## NA's :161684 NA's :160 NA's :5
## ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat
## Min. :0.0 Min. :0.0000 Min. :0.0 Min. : 0.000
## 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0 1st Qu.: 1.000
## Median :1.0 Median :0.0000 Median :1.0 Median : 7.000
## Mean :0.6 Mean :0.7258 Mean :0.5 Mean : 6.564
## 3rd Qu.:1.0 3rd Qu.:0.0000 3rd Qu.:1.0 3rd Qu.:11.000
## Max. :1.0 Max. :9.0000 Max. :1.0 Max. :17.000
## NA's :616911 NA's :400359
## ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat
## Min. :0.000 Min. :0.0000 Min. :0.00 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:0.00 1st Qu.:1.0000
## Median :1.000 Median :1.0000 Median :2.00 Median :1.0000
## Mean :0.948 Mean :0.8323 Mean :1.33 Mean :0.9921
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:2.00 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :4.00 Max. :2.0000
## NA's :17331 NA's :877
## ps_car_11_cat ps_car_11 ps_car_12 ps_car_13
## Min. : 1.00 Min. :0.000 Min. :0.1414 Min. :0.2758
## 1st Qu.: 32.00 1st Qu.:2.000 1st Qu.:0.3162 1st Qu.:0.6712
## Median : 65.00 Median :3.000 Median :0.3742 Median :0.7661
## Mean : 62.28 Mean :2.347 Mean :0.3800 Mean :0.8136
## 3rd Qu.: 94.00 3rd Qu.:3.000 3rd Qu.:0.4000 3rd Qu.:0.9061
## Max. :104.00 Max. :3.000 Max. :1.2649 Max. :4.0313
## NA's :1
## ps_car_14 ps_car_15 ps_calc_01 ps_calc_02
## Min. :0.11 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.35 1st Qu.:2.828 1st Qu.:0.2000 1st Qu.:0.2000
## Median :0.37 Median :3.317 Median :0.4000 Median :0.5000
## Mean :0.37 Mean :3.068 Mean :0.4496 Mean :0.4505
## 3rd Qu.:0.40 3rd Qu.:3.606 3rd Qu.:0.7000 3rd Qu.:0.7000
## Max. :0.64 Max. :3.742 Max. :0.9000 Max. :0.9000
## NA's :63805
## ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 1.000
## 1st Qu.:0.2000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 7.000
## Median :0.4000 Median :2.000 Median :2.000 Median : 8.000
## Mean :0.4501 Mean :2.371 Mean :1.885 Mean : 7.688
## 3rd Qu.:0.7000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :0.9000 Max. :5.000 Max. :6.000 Max. :10.000
##
## ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10
## Min. :0.00 Min. : 1.000 Min. :0.000 Min. : 0.000
## 1st Qu.:2.00 1st Qu.: 8.000 1st Qu.:1.000 1st Qu.: 6.000
## Median :3.00 Median : 9.000 Median :2.000 Median : 8.000
## Mean :3.01 Mean : 9.226 Mean :2.339 Mean : 8.443
## 3rd Qu.:4.00 3rd Qu.:10.000 3rd Qu.:3.000 3rd Qu.:10.000
## Max. :9.00 Max. :12.000 Max. :7.000 Max. :25.000
##
## ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 6.00
## Median : 5.000 Median : 1.00 Median : 3.000 Median : 7.00
## Mean : 5.438 Mean : 1.44 Mean : 2.875 Mean : 7.54
## 3rd Qu.: 7.000 3rd Qu.: 2.00 3rd Qu.: 4.000 3rd Qu.: 9.00
## Max. :20.000 Max. :11.00 Max. :15.000 Max. :28.00
##
## ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.1237 Mean :0.6278 Mean :0.5547 Mean :0.2878
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## ps_calc_19_bin ps_calc_20_bin
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3493 Mean :0.1524
## 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
##
The test data is around 1.5 times that of train.
glimpse(test)## Observations: 892,816
## Variables: 58
## $ id <int> 0, 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 14, 15, 18,...
## $ ps_ind_01 <int> 0, 4, 5, 0, 5, 0, 0, 0, 0, 1, 0, 1, 1, 3, 0, 2,...
## $ ps_ind_02_cat <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,...
## $ ps_ind_03 <int> 8, 5, 3, 6, 7, 6, 3, 0, 7, 6, 5, 4, 2, 3, 1, 2,...
## $ ps_ind_04_cat <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_ind_05_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 6, 0,...
## $ ps_ind_06_bin <int> 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0,...
## $ ps_ind_07_bin <int> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ps_ind_08_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ ps_ind_09_bin <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_10_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15 <int> 12, 5, 10, 4, 4, 10, 11, 7, 6, 7, 3, 9, 8, 0, 8...
## $ ps_ind_16_bin <int> 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ ps_ind_17_bin <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_reg_01 <dbl> 0.5, 0.9, 0.4, 0.1, 0.9, 0.9, 0.1, 0.9, 0.4, 0....
## $ ps_reg_02 <dbl> 0.3, 0.5, 0.0, 0.2, 0.4, 0.5, 0.1, 1.1, 0.0, 1....
## $ ps_reg_03 <dbl> 0.6103278, 0.7713624, 0.9161741, NA, 0.8177714,...
## $ ps_car_01_cat <int> 7, 4, 11, 7, 11, 9, 6, 7, 11, 11, 11, 11, 11, 1...
## $ ps_car_02_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_car_03_cat <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ ps_car_04_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2,...
## $ ps_car_05_cat <int> NA, 0, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat <int> 1, 11, 14, 1, 11, 11, 1, 11, 2, 4, 11, 7, 6, 1,...
## $ ps_car_07_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1...
## $ ps_car_08_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1,...
## $ ps_car_09_cat <int> 2, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 1, 0, 2,...
## $ ps_car_10_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat <int> 65, 103, 29, 40, 101, 11, 10, 103, 104, 104, 10...
## $ ps_car_11 <int> 1, 1, 3, 2, 3, 2, 2, 3, 2, 2, 3, 3, 2, 1, 2, 0,...
## $ ps_car_12 <dbl> 0.3162278, 0.3162278, 0.4000000, 0.3741657, 0.3...
## $ ps_car_13 <dbl> 0.6695564, 0.6063200, 0.8962387, 0.6521104, 0.8...
## $ ps_car_14 <dbl> 0.3521363, 0.3583295, 0.3984972, 0.3814446, 0.3...
## $ ps_car_15 <dbl> 3.464102, 2.828427, 3.316625, 2.449490, 3.31662...
## $ ps_calc_01 <dbl> 0.1, 0.4, 0.6, 0.1, 0.9, 0.7, 0.9, 0.8, 0.9, 0....
## $ ps_calc_02 <dbl> 0.8, 0.5, 0.6, 0.5, 0.6, 0.9, 0.8, 0.9, 0.3, 0....
## $ ps_calc_03 <dbl> 0.6, 0.4, 0.6, 0.5, 0.8, 0.4, 0.8, 0.5, 0.0, 0....
## $ ps_calc_04 <int> 1, 3, 2, 2, 3, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 1,...
## $ ps_calc_05 <int> 1, 3, 3, 1, 4, 1, 1, 2, 2, 1, 2, 2, 1, 2, 3, 4,...
## $ ps_calc_06 <int> 6, 8, 7, 7, 7, 9, 7, 8, 9, 7, 5, 9, 8, 7, 7, 8,...
## $ ps_calc_07 <int> 3, 4, 4, 3, 1, 5, 3, 4, 7, 1, 5, 3, 0, 1, 3, 3,...
## $ ps_calc_08 <int> 6, 10, 6, 12, 10, 9, 9, 11, 9, 9, 7, 10, 9, 9, ...
## $ ps_calc_09 <int> 2, 2, 3, 1, 4, 4, 5, 2, 0, 1, 2, 1, 4, 1, 4, 3,...
## $ ps_calc_10 <int> 9, 7, 12, 13, 12, 12, 6, 8, 10, 11, 10, 4, 7, 1...
## $ ps_calc_11 <int> 1, 2, 4, 5, 4, 8, 2, 3, 5, 6, 2, 4, 3, 7, 6, 9,...
## $ ps_calc_12 <int> 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 2, 4, 0, 3, 0,...
## $ ps_calc_13 <int> 1, 3, 2, 0, 0, 4, 4, 4, 4, 6, 1, 7, 0, 5, 5, 2,...
## $ ps_calc_14 <int> 12, 10, 4, 5, 4, 9, 6, 9, 6, 10, 8, 8, 12, 9, 6...
## $ ps_calc_15_bin <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ ps_calc_17_bin <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,...
## $ ps_calc_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_calc_19_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_20_bin <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
The variables with missing values along with their missing count and percentage is shown in the table below.
# Row binding train and test data
rbtt <- (train %>% select(-id,-target)) %>% bind_rows(test %>% select(-id))
# Missing values information
mvNos <- apply(rbtt, MARGIN = 2,
function(x){
sum(is.na(x))
})
mvInfo <- (data.frame(predictorName = colnames(rbtt), missCount = mvNos, missPercentage = 100*mvNos/nrow(rbtt)))
rownames(mvInfo) <- NULL
mvInfo <- mvInfo %>% filter(missCount > 0) %>% arrange(missCount)
mvInfoThere are 13 variables with missing values where variable ps_car_03_cat has the highest count and ps_car_12 having the lowest count.
The Spearman correlation plot of complete cases is shown below. Here the categorical variables are not converted to factor and used as integer. The plot include variables whose average of absolute pairwise correlation coefficients with other variables is greater than 0.3 to avoid unnecessary cluttering.
# Variable correlation matrix
correlations <- cor(rbtt, use = "complete.obs", method = "spearman")
# Variables having significant correlations
sigCorr <- findCorrelation(correlations, cutoff = .3)
corrplot(cor(rbtt %>% select(sigCorr), use = "complete.obs", method = "spearman"), order = "alphabet")It is clear from the table and correlation plot above that variables ps_car_11, ps_ind_04_cat, ps_car_01_cat, ps_ind_02_cat, ps_car_09_cat, ps_ind_05_cat, ps_car_07_cat and ps_car_05_cat with missing values have correlations less than 0.3 in absolute value. Also since the variables are anonymized we do not know the what these variables actually stand for. Therefore instead of missing value imputation we keep “-1” as it is in the data.
rm(rbtt)
# Read train and test files with -1 as it is
train <- fread("train.csv")
test <- fread("test.csv")
# Class outcome for train
outcome <- factor(ifelse(train$target, "Claim", "NoClaim"), levels = c("Claim", "NoClaim"))
# Combining train and test files
comb <- train %>% select(-id,-target) %>% bind_rows(test %>% select(-id))
rm(train,test)We reform variables that are categorical in nature to type factor. Some categorical variables have only two unique levels, hence we convert those to integers having value 0 or 1.
# Converting categorical variable to factor from integer
comb <- comb %>%
mutate_at(vars(ends_with("cat")), funs(factor))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
# Converting data types of categorical variables that are binary in nature
catv <- apply(comb %>% select(ends_with("cat")),
MARGIN = 2,
function(x){
length(unique(x))
})
comb <- comb %>%
mutate_at(names(catv[catv==2]), funs(as.character)) %>%
mutate_at(names(catv[catv==2]), funs(as.integer))
rm(catv)
# Splitting data set into train and test dataframes
dtrain <- comb[1:length(outcome),]
dtest <- comb[-(1:length(outcome)),]
rm(comb)nzv <- nearZeroVar(dtrain %>% mutate(target = outcome))
dtrain %>% mutate(target = outcome) %>% select(nzv) %>% colnames()## [1] "ps_ind_05_cat" "ps_ind_10_bin" "ps_ind_11_bin" "ps_ind_12_bin"
## [5] "ps_ind_13_bin" "ps_ind_14" "ps_reg_03" "ps_car_10_cat"
## [9] "target"
There are 9 near-zero variance variables. Moreover, target is also included in this list. Hence we will not remove any of these variables.
Shown below are stacked bar plots and density plots depicting distribution of covariates by outcome levels.
f1 <- dtrain %>%
ggplot(aes(ps_ind_01, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f2 <- dtrain %>%
ggplot(aes(ps_ind_02_cat, fill = outcome)) +
geom_bar() +
scale_y_log10()
f3 <- dtrain %>%
ggplot(aes(ps_ind_03, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f4 <- dtrain %>%
ggplot(aes(ps_ind_04_cat, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f5 <- dtrain %>%
ggplot(aes(ps_ind_05_cat, fill = outcome)) +
geom_bar() +
scale_y_log10()
f6 <- dtrain %>%
ggplot(aes(ps_ind_06_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f7 <- dtrain %>%
ggplot(aes(ps_ind_07_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f8 <- dtrain %>%
ggplot(aes(ps_ind_08_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f9 <- dtrain %>%
ggplot(aes(ps_ind_09_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f10 <- dtrain %>%
ggplot(aes(ps_ind_10_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f11 <- dtrain %>%
ggplot(aes(ps_ind_11_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f12 <- dtrain %>%
ggplot(aes(ps_ind_12_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f13 <- dtrain %>%
ggplot(aes(ps_ind_13_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f14 <- dtrain %>%
ggplot(aes(ps_ind_14, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f15 <- dtrain %>%
ggplot(aes(ps_ind_15, fill = outcome)) +
geom_bar() +
scale_y_log10()
f16 <- dtrain %>%
ggplot(aes(ps_ind_16_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f17 <- dtrain %>%
ggplot(aes(ps_ind_17_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
f18 <- dtrain %>%
ggplot(aes(ps_ind_18_bin, fill = outcome)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
lay <- rbind(c(1,1,1,2,2),
c(3,3,3,4,4))
grid.arrange(f1, f2, f3, f4, layout_matrix = lay)lay <- rbind(c(1,1,1,2,3),
c(4,5,6,7,8))
grid.arrange(f5, f6, f7, f8, f9, f10, f11, f12, layout_matrix = lay)lay <- rbind(c(1,2,2,2,2),
c(3,3,4,5,6))
grid.arrange(f13, f15, f14, f16, f17, f18, layout_matrix = lay)Apparently all levels of ind variables cannot discern either level of outcome clearly, except level 4 in ps_ind_14 belong to NoClaim. But since proportion of level 4 is relatively small, this finding has no significant evidence.
f19 <- dtrain %>%
ggplot(aes(ps_reg_01, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05)
f20 <- dtrain %>%
ggplot(aes(ps_reg_02, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05)
f21 <- dtrain %>%
ggplot(aes(ps_reg_03, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05)
grid.arrange(f19, f20, f21, nrow = 3)The reg variables distributions are different for outcome levels. They can prove to be useful for discerning the outcome.
f22 <- dtrain %>%
ggplot(aes(ps_car_01_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f23 <- dtrain %>%
ggplot(aes(ps_car_02_cat, fill = outcome)) +
geom_bar() + scale_y_log10()
f24 <- dtrain %>%
ggplot(aes(ps_car_03_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f25 <- dtrain %>%
ggplot(aes(ps_car_04_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f26 <- dtrain %>%
ggplot(aes(ps_car_05_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f27 <- dtrain %>%
ggplot(aes(ps_car_06_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f28 <- dtrain %>%
ggplot(aes(ps_car_07_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f29 <- dtrain %>%
ggplot(aes(ps_car_08_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f30 <- dtrain %>%
ggplot(aes(ps_car_09_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f31 <- dtrain %>%
ggplot(aes(ps_car_10_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f32 <- dtrain %>%
ggplot(aes(ps_car_11_cat, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f33 <- dtrain %>%
ggplot(aes(ps_car_11, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f34 <- dtrain %>%
ggplot(aes(ps_car_12, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
f35 <- dtrain %>%
ggplot(aes(ps_car_13, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
f36 <- dtrain %>%
ggplot(aes(ps_car_14, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
f37 <- dtrain %>%
ggplot(aes(ps_car_15, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
lay <- rbind(c(1,1,1,2,2),
c(3,3,4,4,4))
grid.arrange(f22, f23, f24, f25, layout_matrix = lay)lay <- rbind(c(1,1,2,2,3),
c(4,4,4,5,5))
grid.arrange(f26, f28, f29, f27, f30, layout_matrix = lay)lay <- rbind(c(1,1,2,2,NA),
c(3,3,3,3,3))
grid.arrange(f31, f33, f32, layout_matrix = lay)lay <- rbind(c(1,1,2,2),
c(3,3,4,4))
grid.arrange(f34, f35, f36, f37, layout_matrix = lay)We observe that
The missing values of varibles ps_car_02 and ps_car_11 belong to the outcome level NoClaim. But because of their low proportion evidence is weak.
Though variable ps_car_11_cat is categorical it has a whopping 104 levels.
The variables ps_car_13 and ps_car_15 show difference in their distributions for levels of outcome as compared to that of variables ps_car_12 and ps_car_14.
f38 <- dtrain %>%
ggplot(aes(ps_calc_01, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
f39 <- dtrain %>%
ggplot(aes(ps_calc_02, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
f40 <- dtrain %>%
ggplot(aes(ps_calc_03, fill = outcome)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
f41 <- dtrain %>%
ggplot(aes(ps_calc_04, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f42 <- dtrain %>%
ggplot(aes(ps_calc_05, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f43 <- dtrain %>%
ggplot(aes(ps_calc_06, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f44 <- dtrain %>%
ggplot(aes(ps_calc_07, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f45 <- dtrain %>%
ggplot(aes(ps_calc_08, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f46 <- dtrain %>%
ggplot(aes(ps_calc_09, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f47 <- dtrain %>%
ggplot(aes(ps_calc_10, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f48 <- dtrain %>%
ggplot(aes(ps_calc_11, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f49 <- dtrain %>%
ggplot(aes(ps_calc_12, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f50 <- dtrain %>%
ggplot(aes(ps_calc_13, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f51 <- dtrain %>%
ggplot(aes(ps_calc_14, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f52 <- dtrain %>%
ggplot(aes(ps_calc_15_bin, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f53 <- dtrain %>%
ggplot(aes(ps_calc_16_bin, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f54 <- dtrain %>%
ggplot(aes(ps_calc_17_bin, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f55 <- dtrain %>%
ggplot(aes(ps_calc_18_bin, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f56 <- dtrain %>%
ggplot(aes(ps_calc_19_bin, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
f57 <- dtrain %>%
ggplot(aes(ps_calc_20_bin, fill = outcome)) +
geom_bar() + scale_y_log10() +
theme(legend.position = "none")
lay <- rbind(c(1,1,2,2,3,3),
c(4,4,4,5,5,5))
grid.arrange(f38, f39, f40, f41, f42, layout_matrix = lay)lay <- rbind(c(1,1,2,2),
c(3,3,4,4))
grid.arrange(f43, f44, f45, f46, layout_matrix = lay)lay <- rbind(c(1,1,2,2),
c(3,3,4,4))
grid.arrange(f47, f48, f49, f50, layout_matrix = lay)lay <- rbind(c(1,1,1,1,2),
c(3,4,5,6,7))
grid.arrange(f51, f52, f53, f54, f55, f56, f57, layout_matrix = lay)We observe that almost all levels of the calc variables have same proportion of outcome levels. Hence it seems that all the calc variables have more of noise than signal.
We now separate our covariates into categorical and numerical types.
# Numerical variables
numVars <- dtrain %>% select(-ends_with("bin"),-ends_with("cat")) %>% colnames()
# Categorical variables
catVars <- dtrain %>% select(ends_with("bin"),ends_with("cat")) %>% colnames()A two-tailed Wilcoxon rank sum test is applied followed by Benjamini and Hochberg p-value correction to numerical type covariates. In this case, the null hypothesis is that the distribution of a covariate for the levels of outcome differ by a location shift of 0 .
# Apply Wilcoxon rank sum test on each numerical type covariate
wrst <- apply(dtrain %>% select(numVars), 2,
function(x, y)
{
wxn <- wilcox.test(x ~ y, conf.int = T)[c("statistic", "p.value")]
unlist(wxn)
},
y = ifelse(outcome == "Claim", 1,0))
wrst <- as.data.frame(t(wrst))
names(wrst) <- c("W.Statistic", "W.test_p.value")
wrst$p.adj <- p.adjust(wrst$W.test_p.value , method = "fdr")
wrst$Predictor <- rownames(wrst)
# Scatter plot where each point is a variable and whose coordinates are transformed
# adj.p value and W. statistic
wrst_plot <- wrst %>% mutate(Significance=ifelse(wrst$p.adj<0.01, "p.adj<0.01", "Not Significant"),
W.Statistic_norm = W.Statistic/max(W.Statistic)) %>%
ggplot(aes(x = W.Statistic_norm, y = -log10(p.adj), color = Significance)) +
geom_point(aes(text = (paste("Predictor:", Predictor, "<br>p.adj:", p.adj))), alpha = 0.4, size = 4) +
labs(x = "Normalize W. Statistic", y = "-log10(p.adj)")
ggplotly(wrst_plot, tooltip = c("text"))From the above plot we see that in this category ps_reg_03 has high potential to differentiate between the outcome levels, which is also supported by the density plot of ps_reg_03. All calc variables in this category show no signal whatsoever.
A one-tailed Pearson’s Chi-squared test is applied followed by Benjamini and Hochberg p-value correction to categorical type covariates. In this case, the null hypothesis is that the frequency distribution of a covariate is same for both levels of the outcome.
# Function to compute contingency table and Chi squared test statistic with its p-value
tableCalcs <- function(x, y)
{
tab <- table(x, y)
cst <- chisq.test(tab)
out <- c(statistic = cst$statistic,
P = cst$p.value)
}
# Apply Wilcoxon rank sum test on each numerical type covariate
chisqt <- apply(dtrain %>% select(catVars), 2, tableCalcs, y = outcome)
chisqt <- as.data.frame(t(chisqt))
names(chisqt) <- c("Chisq.Statistic", "Chisq.test_p.value")
chisqt$p.adj <- p.adjust(chisqt$Chisq.test_p.value , method = "fdr")
chisqt$Predictor <- rownames(chisqt)
# Scatter plot where each point is a variable and whose coordinates are transformed
# adj.p value and Chisq.Statistic
chisqt_plot <- chisqt %>% mutate(Significance=ifelse(chisqt$p.adj<0.01, "p.adj<0.01", "Not Significant")) %>%
ggplot(aes(x = log(Chisq.Statistic), y = -log10(p.adj), color = Significance)) +
geom_point(aes(text = (paste("Predictor:", Predictor, "<br>p.adj:", p.adj))), alpha = 0.4, size = 4) +
labs(x = "log(Chisq.Statistic)", y = "-log10(p.adj)")
ggplotly(chisqt_plot, tooltip = c("text"))From the above plot we see that in this category ps_car_11_cat is ranked relatively high as compared to others. Again none of the calc variables are deemed important by this test. There also seems a deterministic relation between the transformed adjusted p-value and Chi-squared test statistic.
For both type of covariates we select those whose associated adjusted p-value is less than 0.01. The categorical covariates are further decomposed into binary by dummification.
# Selecting predictors passing the tests
sig_.01_f <- c(wrst %>% filter(p.adj < 0.01) %>% .$Predictor,
chisqt %>% filter(p.adj < 0.01) %>% .$Predictor)
dtrain <- dtrain %>% select(sig_.01_f)
dtest <- dtest %>% select(sig_.01_f)
# One hot encoding categorical covariates
dmy <- dummyVars(" ~ .", data = dtrain)
dtrain <- data.frame(predict(dmy, newdata = dtrain))
dtest <- data.frame(predict(dmy, newdata = dtest))The dimensions of the train and test data after the above process are
dim(dtrain)## [1] 595212 199
dim(dtest)## [1] 892816 199
Extreme Gradient Boosting tree based method is used for building models. The competition evaluation metric was Normalized Gini Coefficient, that is \(2*AUC -1\), where AUC is the area under receiver operating characteristic curve. For tuning the models we use AUC as performance metric along with 6-fold cross validation scheme. Two models with different hyperparameter values are build and their predictions are ensembled by taking their harmonic mean.
set.seed(8765)
ps_xgbt_6f_8765 <- train(x = dtrain, y = outcome,
method = "xgbTree",
metric = "ROC",
tuneGrid = expand.grid(nrounds = seq(50,3000,50),
max_depth = 4,
eta = 0.01,
gamma = 0.3,
colsample_bytree = 0.9,
min_child_weight = 0,
subsample = 0.65),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = twoClassSummary,
search = "grid"),
alpha = 10)
ps_xgbt_6f_8765## eXtreme Gradient Boosting
##
## 595212 samples
## 199 predictor
## 2 classes: 'Claim', 'NoClaim'
##
## No pre-processing
## Resampling: Cross-Validated (6 fold)
## Summary of sample sizes: 496011, 496010, 496009, 496010, 496009, 496011, ...
## Resampling results across tuning parameters:
##
## nrounds ROC Sens Spec
## 50 0.6014241 0.000000e+00 1.0000000
## 100 0.6073256 0.000000e+00 1.0000000
## 150 0.6115685 0.000000e+00 1.0000000
## 200 0.6150729 0.000000e+00 1.0000000
## 250 0.6189560 0.000000e+00 1.0000000
## 300 0.6228368 0.000000e+00 1.0000000
## 350 0.6264750 0.000000e+00 1.0000000
## 400 0.6292653 0.000000e+00 1.0000000
## 450 0.6316193 0.000000e+00 1.0000000
## 500 0.6333705 0.000000e+00 1.0000000
## 550 0.6348569 0.000000e+00 1.0000000
## 600 0.6359547 0.000000e+00 1.0000000
## 650 0.6370333 0.000000e+00 1.0000000
## 700 0.6377656 0.000000e+00 1.0000000
## 750 0.6384864 0.000000e+00 1.0000000
## 800 0.6391282 0.000000e+00 1.0000000
## 850 0.6396945 0.000000e+00 1.0000000
## 900 0.6401506 0.000000e+00 1.0000000
## 950 0.6405374 0.000000e+00 1.0000000
## 1000 0.6409591 0.000000e+00 1.0000000
## 1050 0.6413068 0.000000e+00 1.0000000
## 1100 0.6416114 0.000000e+00 1.0000000
## 1150 0.6418509 0.000000e+00 1.0000000
## 1200 0.6420609 0.000000e+00 1.0000000
## 1250 0.6422741 0.000000e+00 1.0000000
## 1300 0.6424910 0.000000e+00 0.9999983
## 1350 0.6426626 4.610420e-05 0.9999983
## 1400 0.6428099 4.610420e-05 0.9999983
## 1450 0.6429623 9.220839e-05 0.9999983
## 1500 0.6431275 9.220839e-05 0.9999983
## 1550 0.6432479 9.220839e-05 0.9999983
## 1600 0.6433807 9.220839e-05 0.9999983
## 1650 0.6434876 9.220839e-05 0.9999983
## 1700 0.6435769 9.220839e-05 0.9999983
## 1750 0.6436743 9.220839e-05 0.9999983
## 1800 0.6437844 9.220839e-05 0.9999983
## 1850 0.6438632 9.220839e-05 0.9999983
## 1900 0.6439749 9.220839e-05 0.9999983
## 1950 0.6440682 9.220839e-05 0.9999983
## 2000 0.6441336 9.220839e-05 0.9999983
## 2050 0.6441769 9.220839e-05 0.9999983
## 2100 0.6442097 9.220839e-05 0.9999983
## 2150 0.6442460 9.220839e-05 0.9999983
## 2200 0.6442260 9.220839e-05 0.9999983
## 2250 0.6442576 9.220839e-05 0.9999983
## 2300 0.6442793 9.220839e-05 0.9999983
## 2350 0.6443322 9.220839e-05 0.9999983
## 2400 0.6443650 9.220839e-05 0.9999983
## 2450 0.6443842 9.220839e-05 0.9999983
## 2500 0.6443695 9.220839e-05 0.9999983
## 2550 0.6443812 9.220839e-05 0.9999983
## 2600 0.6443704 9.220839e-05 0.9999983
## 2650 0.6444024 9.220839e-05 0.9999983
## 2700 0.6444027 9.220839e-05 0.9999983
## 2750 0.6444069 9.220839e-05 0.9999965
## 2800 0.6444012 9.220839e-05 0.9999965
## 2850 0.6443677 9.220839e-05 0.9999965
## 2900 0.6443400 9.220839e-05 0.9999965
## 2950 0.6443642 9.220839e-05 0.9999965
## 3000 0.6443598 9.220839e-05 0.9999965
##
## Tuning parameter 'max_depth' was held constant at a value of 4
## 0.9
## Tuning parameter 'min_child_weight' was held constant at a value of
## 0
## Tuning parameter 'subsample' was held constant at a value of 0.65
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 2750, max_depth =
## 4, eta = 0.01, gamma = 0.3, colsample_bytree = 0.9, min_child_weight =
## 0 and subsample = 0.65.
The cross-validated Normalized Gini Coefficient for the above model is 0.2888138.
set.seed(9876)
ps_xgbt_6f_9876 <- train(x = dtrain, y = outcome,
method = "xgbTree",
metric = "ROC",
tuneGrid = expand.grid(nrounds = seq(50,4000,50),
max_depth = 4,
eta = 0.01,
gamma = 0,
colsample_bytree = 0.9,
min_child_weight = 3,
subsample = 0.5),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = twoClassSummary,
search = "grid"),
alpha = 10)
ps_xgbt_6f_9876## eXtreme Gradient Boosting
##
## 595212 samples
## 199 predictor
## 2 classes: 'Claim', 'NoClaim'
##
## No pre-processing
## Resampling: Cross-Validated (6 fold)
## Summary of sample sizes: 496010, 496011, 496009, 496011, 496009, 496010, ...
## Resampling results across tuning parameters:
##
## nrounds ROC Sens Spec
## 50 0.5992751 0.000000e+00 1.0000000
## 100 0.6067106 0.000000e+00 1.0000000
## 150 0.6114146 0.000000e+00 1.0000000
## 200 0.6149008 0.000000e+00 1.0000000
## 250 0.6188799 0.000000e+00 1.0000000
## 300 0.6227925 0.000000e+00 1.0000000
## 350 0.6260832 0.000000e+00 1.0000000
## 400 0.6290621 0.000000e+00 1.0000000
## 450 0.6314908 0.000000e+00 1.0000000
## 500 0.6332994 0.000000e+00 1.0000000
## 550 0.6347740 0.000000e+00 1.0000000
## 600 0.6360976 0.000000e+00 1.0000000
## 650 0.6370692 0.000000e+00 1.0000000
## 700 0.6379984 0.000000e+00 1.0000000
## 750 0.6387514 0.000000e+00 1.0000000
## 800 0.6392878 0.000000e+00 1.0000000
## 850 0.6398697 0.000000e+00 1.0000000
## 900 0.6403674 0.000000e+00 1.0000000
## 950 0.6408321 0.000000e+00 1.0000000
## 1000 0.6411665 0.000000e+00 1.0000000
## 1050 0.6414965 0.000000e+00 1.0000000
## 1100 0.6418160 0.000000e+00 1.0000000
## 1150 0.6421040 0.000000e+00 1.0000000
## 1200 0.6423593 0.000000e+00 1.0000000
## 1250 0.6425700 0.000000e+00 1.0000000
## 1300 0.6427882 0.000000e+00 1.0000000
## 1350 0.6429183 0.000000e+00 1.0000000
## 1400 0.6430525 0.000000e+00 1.0000000
## 1450 0.6432132 0.000000e+00 1.0000000
## 1500 0.6433660 0.000000e+00 1.0000000
## 1550 0.6435067 0.000000e+00 1.0000000
## 1600 0.6436087 0.000000e+00 1.0000000
## 1650 0.6436954 0.000000e+00 1.0000000
## 1700 0.6437715 0.000000e+00 1.0000000
## 1750 0.6438921 0.000000e+00 1.0000000
## 1800 0.6439633 0.000000e+00 1.0000000
## 1850 0.6440385 0.000000e+00 1.0000000
## 1900 0.6441198 0.000000e+00 1.0000000
## 1950 0.6441692 0.000000e+00 1.0000000
## 2000 0.6442315 0.000000e+00 1.0000000
## 2050 0.6442940 0.000000e+00 1.0000000
## 2100 0.6443854 0.000000e+00 1.0000000
## 2150 0.6444275 0.000000e+00 1.0000000
## 2200 0.6444394 0.000000e+00 1.0000000
## 2250 0.6444586 0.000000e+00 1.0000000
## 2300 0.6444788 0.000000e+00 1.0000000
## 2350 0.6445244 0.000000e+00 1.0000000
## 2400 0.6445915 0.000000e+00 1.0000000
## 2450 0.6446100 0.000000e+00 1.0000000
## 2500 0.6446240 0.000000e+00 1.0000000
## 2550 0.6446438 0.000000e+00 1.0000000
## 2600 0.6446664 0.000000e+00 0.9999983
## 2650 0.6446694 0.000000e+00 0.9999983
## 2700 0.6446783 0.000000e+00 0.9999983
## 2750 0.6447230 0.000000e+00 0.9999983
## 2800 0.6447030 9.219564e-05 0.9999983
## 2850 0.6447186 9.219564e-05 0.9999983
## 2900 0.6447465 9.219564e-05 0.9999983
## 2950 0.6447184 9.219564e-05 0.9999983
## 3000 0.6447291 9.219564e-05 0.9999983
## 3050 0.6447476 9.219564e-05 0.9999983
## 3100 0.6447402 1.382871e-04 0.9999983
## 3150 0.6447525 9.219564e-05 0.9999983
## 3200 0.6447079 1.843785e-04 0.9999983
## 3250 0.6446704 1.843785e-04 0.9999983
## 3300 0.6446563 1.843785e-04 0.9999983
## 3350 0.6446641 1.382871e-04 0.9999983
## 3400 0.6446431 9.219564e-05 0.9999983
## 3450 0.6446492 9.219564e-05 0.9999983
## 3500 0.6446198 1.382871e-04 0.9999983
## 3550 0.6446332 1.382871e-04 0.9999983
## 3600 0.6446014 1.843785e-04 0.9999983
## 3650 0.6445874 1.382871e-04 0.9999983
## 3700 0.6445727 1.843785e-04 0.9999983
## 3750 0.6445566 1.382871e-04 0.9999983
## 3800 0.6445549 1.382871e-04 0.9999983
## 3850 0.6445344 1.382871e-04 0.9999983
## 3900 0.6445038 1.382871e-04 0.9999983
## 3950 0.6444976 1.843785e-04 0.9999983
## 4000 0.6444753 1.843785e-04 0.9999983
##
## Tuning parameter 'max_depth' was held constant at a value of 4
## 0.9
## Tuning parameter 'min_child_weight' was held constant at a value of
## 3
## Tuning parameter 'subsample' was held constant at a value of 0.5
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 3150, max_depth =
## 4, eta = 0.01, gamma = 0, colsample_bytree = 0.9, min_child_weight =
## 3 and subsample = 0.5.
The cross-validated Normalized Gini Coefficient for the above model is 0.289505.
We now take harmonic mean of predictions of the above mentioned two models and write the submission files. We get a small improvement in the performance metric using the harmonic mean of the above two models.
# Model #1 predictions
p_8765 <- predict(ps_xgbt_6f_8765, dtest, type = "prob")[,"Claim"]
sub_ps_xgbt_6f_8765 <- fread("sample_submission.csv") %>% mutate(target = p_8765)
write_excel_csv(sub_ps_xgbt_6f_8765, "sub_ps_xgbt_6f_8765.csv")
# Model #2 predictions
p_9876 <- predict(ps_xgbt_6f_9876, dtest, type = "prob")[,"Claim"]
sub_ps_xgbt_6f_9876 <- fread("sample_submission.csv") %>% mutate(target = p_9876)
write_excel_csv(sub_ps_xgbt_6f_9876, "sub_ps_xgbt_6f_9876.csv")
# Harmonic mean of predictions
hm2 <- 1/((1/p_8765 + 1/p_9876)/2)
sub_ps_xgbt_hm2 <- fread("sample_submission.csv") %>% mutate(target = hm2)
write_excel_csv(sub_ps_xgbt_hm2, "sub_ps_xgbt_hm2.csv")The snapshot of private leaderboard scores are
The calc variables were totally removed from the selection process which proves that they would only contribute noise if added to the feature set. Since the dataset was imbalanced, AUC was used as performance metric for tuning the hyperparameters of the xgboost models. It is apparent from the private leaderboard scores that the given dataset was hard to learn from. The proposed solution in this document shows an improvement over the \(35^{th}\) place solution on the private leaderboard, which is among the top 0.7% solutions.