In this challenge, BNP Paribas Cardif is providing an anonymized database with two categories of claims:

  1. claims for which approval could be accelerated leading to faster payments
  2. claims for which additional information is required before approval

Kagglers are challenged to predict the category of a claim based on features available early in the process, helping BNP Paribas Cardif accelerate its claims process and therefore provide a better service to its customers.For more competition details see:

https://www.kaggle.com/c/bnp-paribas-cardif-claims-management.

Data Cleaning

Separation of Data Sets

After taking a quick look at the data we will see there is a lot of NAs.

library(data.table)
trainingData <- fread("train.csv", sep= ',', showProgress=TRUE, data.table=FALSE)
## 
Read 35.0% of 114321 rows
Read 61.2% of 114321 rows
Read 87.5% of 114321 rows
Read 114321 rows and 134 (of 134) columns from 0.093 GB file in 00:00:06
summary(trainingData[,1:12])
##        ID             target             v1              v2       
##  Min.   :     3   Min.   :0.0000   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 57280   1st Qu.:1.0000   1st Qu.: 0.91   1st Qu.: 5.32  
##  Median :114189   Median :1.0000   Median : 1.47   Median : 7.02  
##  Mean   :114229   Mean   :0.7612   Mean   : 1.63   Mean   : 7.46  
##  3rd Qu.:171206   3rd Qu.:1.0000   3rd Qu.: 2.14   3rd Qu.: 9.47  
##  Max.   :228713   Max.   :1.0000   Max.   :20.00   Max.   :20.00  
##                                    NA's   :49832   NA's   :49796  
##       v3                  v4              v5              v6       
##  Length:114321      Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  Class :character   1st Qu.: 3.49   1st Qu.: 7.61   1st Qu.: 2.07  
##  Mode  :character   Median : 4.21   Median : 8.67   Median : 2.41  
##                     Mean   : 4.15   Mean   : 8.74   Mean   : 2.44  
##                     3rd Qu.: 4.83   3rd Qu.: 9.77   3rd Qu.: 2.78  
##                     Max.   :20.00   Max.   :20.00   Max.   :20.00  
##                     NA's   :49796   NA's   :48624   NA's   :49832  
##        v7              v8              v9             v10        
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 2.10   1st Qu.: 0.09   1st Qu.: 7.85   1st Qu.: 1.050  
##  Median : 2.45   Median : 0.39   Median : 9.06   Median : 1.313  
##  Mean   : 2.48   Mean   : 1.50   Mean   : 9.03   Mean   : 1.883  
##  3rd Qu.: 2.83   3rd Qu.: 1.63   3rd Qu.:10.23   3rd Qu.: 2.101  
##  Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :18.534  
##  NA's   :49832   NA's   :48619   NA's   :49851   NA's   :84

Luckily though, the NAs appear to be the result of data missing systematically, and I suspect this is may be because there are multiple types of claims combined into one data set. Maybe different coverage types, or even product lines. Whatever the cause, I suspect is a good chance that trying to fit this as one entire data set will lead to weaker results. I’ve already created an indicator column–v134NAGroup–that tells us which set of data each observation belongs.

Another quick look, this time at the first 25 rows:

head(trainingData[,c(1:7,134)], n=25)
##    ID target        v1        v2 v3       v4        v5 v132NAGroup
## 1   3      1 1.3357394  8.727474  C 3.921026  7.915266           C
## 2   4      1        NA        NA  C       NA  9.191265           B
## 3   5      1 0.9438769  5.310079  C 4.410969  5.326159           C
## 4   6      1 0.7974146  8.304757  C 4.225930 11.627438           C
## 5   8      1        NA        NA  C       NA        NA           A
## 6   9      0        NA        NA  C       NA  8.856791           B
## 7  12      0 0.8998057  7.312995  C 3.494148  9.946200           C
## 8  21      1        NA        NA  C       NA        NA           A
## 9  22      0 2.0786513  8.462619    3.739030  5.265636           C
## 10 23      1 1.1448024  5.880606  C 3.244469  9.538384           C
## 11 24      1        NA        NA  C       NA        NA           A
## 12 27      1        NA        NA  C       NA        NA           A
## 13 28      0        NA        NA  C       NA        NA           A
## 14 30      1 1.4002669  5.367204  C 4.122155  8.137188           C
## 15 31      1 2.2600357 14.693263  C 5.150750  8.554136           C
## 16 32      1        NA        NA  C       NA        NA           A
## 17 33      1 0.6228961  7.024732  C 4.193688  6.288177           C
## 18 34      1        NA        NA  C       NA        NA           A
## 19 35      1        NA        NA  C       NA        NA           A
## 20 36      1        NA        NA  C       NA        NA           A
## 21 37      1 0.9438780  5.927194  C 4.404372  9.045057           C
## 22 39      1 1.2898409  4.788645  C 4.283417 10.719571           C
## 23 40      1 0.7288239  4.073244  C 4.130054  9.032563           C
## 24 42      1 3.9445628  5.718516  C 2.205080  5.340648           C
## 25 43      1 4.0457254  3.992607  C 3.598096  7.946330           C

Now it’s time to separate the data sets and have a look and see if most of our NA’s can be eliminated.

###Separate Data Sets
trainingDataA <- trainingData[trainingData[,134]=="A",] #34 Columns Populated
trainingDataB <- trainingData[trainingData[,134]=="B",] #53 Columns Populated
trainingDataD <- trainingData[trainingData[,134]=="D",] #70 Columns Populated
trainingDataC <- trainingData[trainingData[,134]=="C",] #134 Columns Populated
###Remove NA Columns
trainingDataA <- trainingDataA[ ,colSums(is.na(trainingDataA))<nrow(trainingDataA)]
trainingDataB <- trainingDataB[ ,colSums(is.na(trainingDataB))<nrow(trainingDataB)]
trainingDataD <- trainingDataD[ ,colSums(is.na(trainingDataD))<nrow(trainingDataD)]

Let’s look at the new dimensions of our data:

dataDimensions <- as.data.frame(rbind(dim(trainingDataA),dim(trainingDataB),dim(trainingDataC),dim(trainingDataD)))
names(dataDimensions) <- c("Observations", "Predictors")
dataDimensions
##   Observations Predictors
## 1        47745         34
## 2         2051         53
## 3        64489        134
## 4           36         70

We should now compare our new data to the our inital data summary:

summary(trainingDataA[,1:12])
##        ID             target            v3                 v10        
##  Min.   :     8   Min.   :0.0000   Length:47745       Min.   : 0.000  
##  1st Qu.: 57555   1st Qu.:1.0000   Class :character   1st Qu.: 1.050  
##  Median :114044   Median :1.0000   Mode  :character   Median : 1.313  
##  Mean   :114291   Mean   :0.7715                      Mean   : 1.840  
##  3rd Qu.:171265   3rd Qu.:1.0000                      3rd Qu.: 2.101  
##  Max.   :228710   Max.   :1.0000                      Max.   :14.158  
##                                                       NA's   :68      
##       v12              v14                 v21              v22           
##  Min.   : 0.000   Min.   :-0.000001   Min.   : 0.1178   Length:47745      
##  1st Qu.: 6.298   1st Qu.:11.160338   1st Qu.: 6.3602   Class :character  
##  Median : 6.582   Median :11.866709   Median : 7.0196   Mode  :character  
##  Mean   : 6.838   Mean   :12.010971   Mean   : 6.9840                     
##  3rd Qu.: 6.970   3rd Qu.:12.677570   3rd Qu.: 7.6572                     
##  Max.   :16.303   Max.   :17.879486   Max.   :19.2961                     
##  NA's   :70       NA's   :3           NA's   :281                         
##      v24                v30                v31                 v34        
##  Length:47745       Length:47745       Length:47745       Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.197  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.702  
##                                                           Mean   : 6.616  
##                                                           3rd Qu.: 7.988  
##                                                           Max.   :16.329  
##                                                           NA's   :95
summary(trainingDataB[,1:12])
##        ID             target            v3                  v5        
##  Min.   :     4   Min.   :0.0000   Length:2051        Min.   : 2.487  
##  1st Qu.: 58459   1st Qu.:0.0000   Class :character   1st Qu.: 7.930  
##  Median :118860   Median :1.0000   Mode  :character   Median : 8.749  
##  Mean   :116422   Mean   :0.7392                      Mean   : 8.664  
##  3rd Qu.:173413   3rd Qu.:1.0000                      3rd Qu.: 9.509  
##  Max.   :228712   Max.   :1.0000                      Max.   :11.143  
##                                                                       
##        v8               v10                 v12              v14       
##  Min.   : 0.1714   Min.   :-0.000001   Min.   : 3.719   Min.   : 0.00  
##  1st Qu.: 0.3470   1st Qu.: 1.050328   1st Qu.: 6.330   1st Qu.:11.22  
##  Median : 0.5678   Median : 1.312910   Median : 6.613   Median :11.87  
##  Mean   : 1.2486   Mean   : 1.809336   Mean   : 6.865   Mean   :12.06  
##  3rd Qu.: 1.3045   3rd Qu.: 1.838075   3rd Qu.: 6.979   3rd Qu.:12.63  
##  Max.   :20.0000   Max.   : 7.877461   Max.   :11.389   Max.   :17.45  
##                    NA's   :2           NA's   :2                       
##       v21              v22                v24                 v25         
##  Min.   : 0.8763   Length:2051        Length:2051        Min.   : 0.1012  
##  1st Qu.: 6.4379   Class :character   Class :character   1st Qu.: 0.3932  
##  Median : 7.0061   Mode  :character   Mode  :character   Median : 0.6754  
##  Mean   : 7.0250                                         Mean   : 1.4128  
##  3rd Qu.: 7.6666                                         3rd Qu.: 1.4776  
##  Max.   :11.6295                                         Max.   :20.0000  
##  NA's   :6
summary(trainingDataC[,1:12])
##        ID             target             v1                  v2           
##  Min.   :     3   Min.   :0.0000   Min.   :-0.000001   Min.   :-0.000001  
##  1st Qu.: 57060   1st Qu.:1.0000   1st Qu.: 0.913580   1st Qu.: 5.318110  
##  Median :114182   Median :1.0000   Median : 1.469550   Median : 7.024732  
##  Mean   :114119   Mean   :0.7543   Mean   : 1.630686   Mean   : 7.465010  
##  3rd Qu.:171076   3rd Qu.:1.0000   3rd Qu.: 2.136128   3rd Qu.: 9.467317  
##  Max.   :228713   Max.   :1.0000   Max.   :20.000001   Max.   :20.000000  
##                                                                           
##       v3                  v4                  v5        
##  Length:64489       Min.   :-0.000001   Min.   : 0.000  
##  Class :character   1st Qu.: 3.487870   1st Qu.: 7.590  
##  Mode  :character   Median : 4.206241   Median : 8.663  
##                     Mean   : 4.145514   Mean   : 8.744  
##                     3rd Qu.: 4.833251   3rd Qu.: 9.782  
##                     Max.   :20.000000   Max.   :20.000  
##                                         NA's   :879     
##        v6                  v7                  v8         
##  Min.   :-0.000001   Min.   :-0.000001   Min.   : 0.0000  
##  1st Qu.: 2.065064   1st Qu.: 2.101477   1st Qu.: 0.0820  
##  Median : 2.412790   Median : 2.452166   Median : 0.3737  
##  Mean   : 2.436402   Mean   : 2.483921   Mean   : 1.5049  
##  3rd Qu.: 2.775285   3rd Qu.: 2.834285   3rd Qu.: 1.6329  
##  Max.   :20.000001   Max.   :20.000000   Max.   :20.0000  
##                                          NA's   :874      
##        v9                 v10           
##  Min.   :-0.000001   Min.   :-0.000001  
##  1st Qu.: 7.853659   1st Qu.: 1.050328  
##  Median : 9.059582   Median : 1.312910  
##  Mean   : 9.031859   Mean   : 1.917446  
##  3rd Qu.:10.232559   3rd Qu.: 2.253830  
##  Max.   :20.000001   Max.   :18.533916  
##  NA's   :19          NA's   :14
summary(trainingDataD[,1:12])
##        ID             target             v2              v3           
##  Min.   :   958   Min.   :0.0000   Min.   : 4.528   Length:36         
##  1st Qu.: 42538   1st Qu.:0.0000   1st Qu.: 4.528   Class :character  
##  Median :110316   Median :1.0000   Median : 4.741   Mode  :character  
##  Mean   :104898   Mean   :0.7222   Mean   : 6.391                     
##  3rd Qu.:157751   3rd Qu.:1.0000   3rd Qu.: 7.213                     
##  Max.   :220744   Max.   :1.0000   Max.   :12.872                     
##        v4              v5               v8              v10        
##  Min.   :2.318   Min.   : 8.006   Min.   :0.2966   Min.   :0.7659  
##  1st Qu.:2.318   1st Qu.: 9.692   1st Qu.:0.4145   1st Qu.:1.0503  
##  Median :3.232   Median : 9.692   Median :0.4790   Median :1.3129  
##  Mean   :3.398   Mean   : 9.629   Mean   :0.8251   Mean   :1.5433  
##  3rd Qu.:4.205   3rd Qu.:10.373   3rd Qu.:0.8868   3rd Qu.:1.5755  
##  Max.   :5.918   Max.   :10.373   Max.   :2.3723   Max.   :4.7046  
##       v12             v14              v17             v21       
##  Min.   :6.005   Min.   : 9.435   Min.   :2.304   Min.   :4.550  
##  1st Qu.:6.331   1st Qu.:11.377   1st Qu.:2.554   1st Qu.:6.513  
##  Median :6.541   Median :12.127   Median :2.554   Median :7.043  
##  Mean   :6.619   Mean   :12.105   Mean   :3.538   Mean   :7.093  
##  3rd Qu.:6.814   3rd Qu.:12.869   3rd Qu.:4.578   3rd Qu.:7.515  
##  Max.   :7.704   Max.   :15.356   Max.   :8.606   Max.   :9.599

Obviously there are still some NAs in our data, but this is probably explained by typical data inputting or measurement errors, rather than systematically missing data. We could probably just omit those observations from our training data in order to make our analysis simpler, or if we feel we can improve our fit and don’t want to lose the data, attempt to use some imputation methods.

Understanding and Visualizing the Data

Something you may have noticed in the previous summary() outputs is that for many of the continuous quantitative variables had a max of 20! This is highly suspect, and I imagine it tells us that CARDIF already partially did some data ‘normalization’, and possibly used 20.0000 as an “unknown”. Checkout the histograms below and see for yourself:

Histograms of Data

I’ve not yet decided how I want to handle the weird maximums in the data but that’s the next step I think.