In this challenge, BNP Paribas Cardif is providing an anonymized database with two categories of claims:
Kagglers are challenged to predict the category of a claim based on features available early in the process, helping BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.For more competition details see:
https://www.kaggle.com/c/bnp-paribas-cardif-claims-management.
After taking a quick look at the data we will see there is a lot of NAs.
library(data.table)
trainingData <- fread("train.csv", sep= ',', showProgress=TRUE, data.table=FALSE)
##
Read 35.0% of 114321 rows
Read 61.2% of 114321 rows
Read 87.5% of 114321 rows
Read 114321 rows and 134 (of 134) columns from 0.093 GB file in 00:00:06
summary(trainingData[,1:12])
## ID target v1 v2
## Min. : 3 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 57280 1st Qu.:1.0000 1st Qu.: 0.91 1st Qu.: 5.32
## Median :114189 Median :1.0000 Median : 1.47 Median : 7.02
## Mean :114229 Mean :0.7612 Mean : 1.63 Mean : 7.46
## 3rd Qu.:171206 3rd Qu.:1.0000 3rd Qu.: 2.14 3rd Qu.: 9.47
## Max. :228713 Max. :1.0000 Max. :20.00 Max. :20.00
## NA's :49832 NA's :49796
## v3 v4 v5 v6
## Length:114321 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 3.49 1st Qu.: 7.61 1st Qu.: 2.07
## Mode :character Median : 4.21 Median : 8.67 Median : 2.41
## Mean : 4.15 Mean : 8.74 Mean : 2.44
## 3rd Qu.: 4.83 3rd Qu.: 9.77 3rd Qu.: 2.78
## Max. :20.00 Max. :20.00 Max. :20.00
## NA's :49796 NA's :48624 NA's :49832
## v7 v8 v9 v10
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.10 1st Qu.: 0.09 1st Qu.: 7.85 1st Qu.: 1.050
## Median : 2.45 Median : 0.39 Median : 9.06 Median : 1.313
## Mean : 2.48 Mean : 1.50 Mean : 9.03 Mean : 1.883
## 3rd Qu.: 2.83 3rd Qu.: 1.63 3rd Qu.:10.23 3rd Qu.: 2.101
## Max. :20.00 Max. :20.00 Max. :20.00 Max. :18.534
## NA's :49832 NA's :48619 NA's :49851 NA's :84
Luckily though, the NAs appear to be the result of data missing systematically, and I suspect this is may be because there are multiple types of claims combined into one data set. Maybe different coverage types, or even product lines. Whatever the cause, I suspect is a good chance that trying to fit this as one entire data set will lead to weaker results. I’ve already created an indicator column–v134NAGroup–that tells us which set of data each observation belongs.
Another quick look, this time at the first 25 rows:
head(trainingData[,c(1:7,134)], n=25)
## ID target v1 v2 v3 v4 v5 v132NAGroup
## 1 3 1 1.3357394 8.727474 C 3.921026 7.915266 C
## 2 4 1 NA NA C NA 9.191265 B
## 3 5 1 0.9438769 5.310079 C 4.410969 5.326159 C
## 4 6 1 0.7974146 8.304757 C 4.225930 11.627438 C
## 5 8 1 NA NA C NA NA A
## 6 9 0 NA NA C NA 8.856791 B
## 7 12 0 0.8998057 7.312995 C 3.494148 9.946200 C
## 8 21 1 NA NA C NA NA A
## 9 22 0 2.0786513 8.462619 3.739030 5.265636 C
## 10 23 1 1.1448024 5.880606 C 3.244469 9.538384 C
## 11 24 1 NA NA C NA NA A
## 12 27 1 NA NA C NA NA A
## 13 28 0 NA NA C NA NA A
## 14 30 1 1.4002669 5.367204 C 4.122155 8.137188 C
## 15 31 1 2.2600357 14.693263 C 5.150750 8.554136 C
## 16 32 1 NA NA C NA NA A
## 17 33 1 0.6228961 7.024732 C 4.193688 6.288177 C
## 18 34 1 NA NA C NA NA A
## 19 35 1 NA NA C NA NA A
## 20 36 1 NA NA C NA NA A
## 21 37 1 0.9438780 5.927194 C 4.404372 9.045057 C
## 22 39 1 1.2898409 4.788645 C 4.283417 10.719571 C
## 23 40 1 0.7288239 4.073244 C 4.130054 9.032563 C
## 24 42 1 3.9445628 5.718516 C 2.205080 5.340648 C
## 25 43 1 4.0457254 3.992607 C 3.598096 7.946330 C
Now it’s time to separate the data sets and have a look and see if most of our NA’s can be eliminated.
###Separate Data Sets
trainingDataA <- trainingData[trainingData[,134]=="A",] #34 Columns Populated
trainingDataB <- trainingData[trainingData[,134]=="B",] #53 Columns Populated
trainingDataD <- trainingData[trainingData[,134]=="D",] #70 Columns Populated
trainingDataC <- trainingData[trainingData[,134]=="C",] #134 Columns Populated
###Remove NA Columns
trainingDataA <- trainingDataA[ ,colSums(is.na(trainingDataA))<nrow(trainingDataA)]
trainingDataB <- trainingDataB[ ,colSums(is.na(trainingDataB))<nrow(trainingDataB)]
trainingDataD <- trainingDataD[ ,colSums(is.na(trainingDataD))<nrow(trainingDataD)]
Let’s look at the new dimensions of our data:
dataDimensions <- as.data.frame(rbind(dim(trainingDataA),dim(trainingDataB),dim(trainingDataC),dim(trainingDataD)))
names(dataDimensions) <- c("Observations", "Predictors")
dataDimensions
## Observations Predictors
## 1 47745 34
## 2 2051 53
## 3 64489 134
## 4 36 70
We should now compare our new data to the our inital data summary:
summary(trainingDataA[,1:12])
## ID target v3 v10
## Min. : 8 Min. :0.0000 Length:47745 Min. : 0.000
## 1st Qu.: 57555 1st Qu.:1.0000 Class :character 1st Qu.: 1.050
## Median :114044 Median :1.0000 Mode :character Median : 1.313
## Mean :114291 Mean :0.7715 Mean : 1.840
## 3rd Qu.:171265 3rd Qu.:1.0000 3rd Qu.: 2.101
## Max. :228710 Max. :1.0000 Max. :14.158
## NA's :68
## v12 v14 v21 v22
## Min. : 0.000 Min. :-0.000001 Min. : 0.1178 Length:47745
## 1st Qu.: 6.298 1st Qu.:11.160338 1st Qu.: 6.3602 Class :character
## Median : 6.582 Median :11.866709 Median : 7.0196 Mode :character
## Mean : 6.838 Mean :12.010971 Mean : 6.9840
## 3rd Qu.: 6.970 3rd Qu.:12.677570 3rd Qu.: 7.6572
## Max. :16.303 Max. :17.879486 Max. :19.2961
## NA's :70 NA's :3 NA's :281
## v24 v30 v31 v34
## Length:47745 Length:47745 Length:47745 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 5.197
## Mode :character Mode :character Mode :character Median : 6.702
## Mean : 6.616
## 3rd Qu.: 7.988
## Max. :16.329
## NA's :95
summary(trainingDataB[,1:12])
## ID target v3 v5
## Min. : 4 Min. :0.0000 Length:2051 Min. : 2.487
## 1st Qu.: 58459 1st Qu.:0.0000 Class :character 1st Qu.: 7.930
## Median :118860 Median :1.0000 Mode :character Median : 8.749
## Mean :116422 Mean :0.7392 Mean : 8.664
## 3rd Qu.:173413 3rd Qu.:1.0000 3rd Qu.: 9.509
## Max. :228712 Max. :1.0000 Max. :11.143
##
## v8 v10 v12 v14
## Min. : 0.1714 Min. :-0.000001 Min. : 3.719 Min. : 0.00
## 1st Qu.: 0.3470 1st Qu.: 1.050328 1st Qu.: 6.330 1st Qu.:11.22
## Median : 0.5678 Median : 1.312910 Median : 6.613 Median :11.87
## Mean : 1.2486 Mean : 1.809336 Mean : 6.865 Mean :12.06
## 3rd Qu.: 1.3045 3rd Qu.: 1.838075 3rd Qu.: 6.979 3rd Qu.:12.63
## Max. :20.0000 Max. : 7.877461 Max. :11.389 Max. :17.45
## NA's :2 NA's :2
## v21 v22 v24 v25
## Min. : 0.8763 Length:2051 Length:2051 Min. : 0.1012
## 1st Qu.: 6.4379 Class :character Class :character 1st Qu.: 0.3932
## Median : 7.0061 Mode :character Mode :character Median : 0.6754
## Mean : 7.0250 Mean : 1.4128
## 3rd Qu.: 7.6666 3rd Qu.: 1.4776
## Max. :11.6295 Max. :20.0000
## NA's :6
summary(trainingDataC[,1:12])
## ID target v1 v2
## Min. : 3 Min. :0.0000 Min. :-0.000001 Min. :-0.000001
## 1st Qu.: 57060 1st Qu.:1.0000 1st Qu.: 0.913580 1st Qu.: 5.318110
## Median :114182 Median :1.0000 Median : 1.469550 Median : 7.024732
## Mean :114119 Mean :0.7543 Mean : 1.630686 Mean : 7.465010
## 3rd Qu.:171076 3rd Qu.:1.0000 3rd Qu.: 2.136128 3rd Qu.: 9.467317
## Max. :228713 Max. :1.0000 Max. :20.000001 Max. :20.000000
##
## v3 v4 v5
## Length:64489 Min. :-0.000001 Min. : 0.000
## Class :character 1st Qu.: 3.487870 1st Qu.: 7.590
## Mode :character Median : 4.206241 Median : 8.663
## Mean : 4.145514 Mean : 8.744
## 3rd Qu.: 4.833251 3rd Qu.: 9.782
## Max. :20.000000 Max. :20.000
## NA's :879
## v6 v7 v8
## Min. :-0.000001 Min. :-0.000001 Min. : 0.0000
## 1st Qu.: 2.065064 1st Qu.: 2.101477 1st Qu.: 0.0820
## Median : 2.412790 Median : 2.452166 Median : 0.3737
## Mean : 2.436402 Mean : 2.483921 Mean : 1.5049
## 3rd Qu.: 2.775285 3rd Qu.: 2.834285 3rd Qu.: 1.6329
## Max. :20.000001 Max. :20.000000 Max. :20.0000
## NA's :874
## v9 v10
## Min. :-0.000001 Min. :-0.000001
## 1st Qu.: 7.853659 1st Qu.: 1.050328
## Median : 9.059582 Median : 1.312910
## Mean : 9.031859 Mean : 1.917446
## 3rd Qu.:10.232559 3rd Qu.: 2.253830
## Max. :20.000001 Max. :18.533916
## NA's :19 NA's :14
summary(trainingDataD[,1:12])
## ID target v2 v3
## Min. : 958 Min. :0.0000 Min. : 4.528 Length:36
## 1st Qu.: 42538 1st Qu.:0.0000 1st Qu.: 4.528 Class :character
## Median :110316 Median :1.0000 Median : 4.741 Mode :character
## Mean :104898 Mean :0.7222 Mean : 6.391
## 3rd Qu.:157751 3rd Qu.:1.0000 3rd Qu.: 7.213
## Max. :220744 Max. :1.0000 Max. :12.872
## v4 v5 v8 v10
## Min. :2.318 Min. : 8.006 Min. :0.2966 Min. :0.7659
## 1st Qu.:2.318 1st Qu.: 9.692 1st Qu.:0.4145 1st Qu.:1.0503
## Median :3.232 Median : 9.692 Median :0.4790 Median :1.3129
## Mean :3.398 Mean : 9.629 Mean :0.8251 Mean :1.5433
## 3rd Qu.:4.205 3rd Qu.:10.373 3rd Qu.:0.8868 3rd Qu.:1.5755
## Max. :5.918 Max. :10.373 Max. :2.3723 Max. :4.7046
## v12 v14 v17 v21
## Min. :6.005 Min. : 9.435 Min. :2.304 Min. :4.550
## 1st Qu.:6.331 1st Qu.:11.377 1st Qu.:2.554 1st Qu.:6.513
## Median :6.541 Median :12.127 Median :2.554 Median :7.043
## Mean :6.619 Mean :12.105 Mean :3.538 Mean :7.093
## 3rd Qu.:6.814 3rd Qu.:12.869 3rd Qu.:4.578 3rd Qu.:7.515
## Max. :7.704 Max. :15.356 Max. :8.606 Max. :9.599
Obviously there are still some NAs in our data, but this is probably explained by typical data inputting or measurement errors, rather than systematically missing data. We could probably just omit those observations from our training data in order to make our analysis simpler, or if we feel we can improve our fit and don’t want to lose the data, attempt to use some imputation methods.
Something you may have noticed in the previous summary() outputs is that for many of the continuous quantitative variables had a max of 20! This is highly suspect, and I imagine it tells us that CARDIF already partially did some data ‘normalization’, and possibly used 20.0000 as an “unknown”. Checkout the histograms below and see for yourself:
I’ve not yet decided how I want to handle the weird maximums in the data but that’s the next step I think.