Catch The Fraudster

library(ggplot2)
library(plyr)
library(dummies)
library(logistf)
library(rattle)

Load the dataset and display a summary.

df <- read.csv("C:\\Users\\Michael\\Documents\\Data Sources\\DataScienceChallenges\\FraudInstance.csv")
df <- subset(df, select=-c(X.))
summary(df)

##  Fraud.Instance    Damaged.Item    Item.Not.Avaiable Item.Not.In.Stock
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000    Min.   :0.0000   
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000    1st Qu.:0.0000   
##  Median :0.0000   Median :1.0000   Median :0.0000    Median :0.0000   
##  Mean   :0.3923   Mean   :0.6691   Mean   :0.1398    Mean   :0.4992   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000    3rd Qu.:1.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000    Max.   :1.0000   
##                                                                       
##  Product.Care.Plan  Claim.Amount  Registered.Online   Age.Group   
##  Min.   :0.0000    $250   :  25   Min.   :0.0000    Min.   :18.0  
##  1st Qu.:0.0000    $252   :  23   1st Qu.:0.0000    1st Qu.:27.0  
##  Median :0.0000    $177   :  22   Median :0.0000    Median :36.0  
##  Mean   :0.2996    $246   :  22   Mean   :0.4914    Mean   :36.4  
##  3rd Qu.:1.0000    $280   :  22   3rd Qu.:1.0000    3rd Qu.:46.0  
##  Max.   :1.0000    $112   :  21   Max.   :1.0000    Max.   :55.0  
##                    (Other):4214                                   
##          Marital.Status Owns.a.Vehicle             Accomodation.Type
##  In-Relationship:1408   Min.   :0.0000   Owns a house       :1441   
##  Married        :1503   1st Qu.:0.0000   Rented             :1409   
##  Unmarried,     :1438   Median :0.0000   Staying with Family:1499   
##                         Mean   :0.4976                              
##                         3rd Qu.:1.0000                              
##                         Max.   :1.0000                              
##                                                                     
##   Height..cms.  
##  Min.   :150.0  
##  1st Qu.:160.0  
##  Median :170.0  
##  Mean   :170.3  
##  3rd Qu.:181.0  
##  Max.   :190.0  
##

Normalize Height, Claim.Amount, and Age.Group and get dummy variables for Marital.Status and Accomodation.Type.

df$Claim.Amount <- as.numeric(substring(df$Claim.Amount, 2))
df <- rename(df, c("Height..cms."="Height"))
df$Claim.Amount <- scale(df$Claim.Amount)
df$Height <- scale(df$Height)
df$Age.Group <- scale(df$Age.Group)
d <- dummy.data.frame(df)

Prepare a training (80% of observations) and a test (20%) set.

set.seed(777)
smpl_size <- floor(0.8 * nrow(d))
train_ind <- sample(seq_len(nrow(d)), size = smpl_size)
train <- df[train_ind, ]
test <- df[-train_ind, ]

Using regular logistic regression (glm), the outcome Fraud.Instance displayed high prevalence which is typically the result when for some combination of the predictors, all the observations have the same event status. In other words, the (continuous) covariates predict the outcome too perfectly. This phenomenon is known as “separation”. One approach to handling this type of problem is to use a penalized likelihood method as proposed by Firth (https://www2.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf) which is available in the logistf package

lr2 <- logistf(Fraud.Instance ~ . -Height, data=train)

summary(lr2)

## logistf(formula = Fraud.Instance ~ . - Height, data = train)
## 
## Model fitted by Penalized ML
## Confidence intervals and p-values by Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood 
## 
##                                               coef  se(coef) lower 0.95
## (Intercept)                           19.562659621 1.8308316  15.632608
## Damaged.Item                         -13.089230019 1.1299026 -23.149654
## Item.Not.Avaiable                    -11.642480790 0.9865918 -23.833563
## Item.Not.In.Stock                    -13.303525399 1.1481160 -22.660583
## Product.Care.Plan                    -12.787771799 1.0971014 -22.332259
## Claim.Amount                           0.034032512 0.3911983  -5.136768
## Registered.Online                     -0.034256288 0.7882868  -5.195757
## Age.Group                             -0.001688132 0.3925457  -5.421669
## Marital.StatusMarried                  0.002495061 0.9645684  -5.607588
## Marital.StatusUnmarried,               0.011638695 0.9724625  -5.651749
## Owns.a.Vehicle                        -0.027099770 0.7898098  -5.208164
## Accomodation.TypeRented               -0.042417922 0.9816267  -5.794113
## Accomodation.TypeStaying with Family  -0.011838261 0.9561503  -5.531582
##                                      upper 0.95        Chisq         p
## (Intercept)                           31.351060          Inf 0.0000000
## Damaged.Item                         -10.608883          Inf 0.0000000
## Item.Not.Avaiable                     -9.708063          Inf 0.0000000
## Item.Not.In.Stock                    -10.784106          Inf 0.0000000
## Product.Care.Plan                    -10.475698          Inf 0.0000000
## Claim.Amount                           5.379394 9.881027e-04 0.9749233
## Registered.Online                      5.126631 3.300075e-04 0.9855063
## Age.Group                              5.392377 0.000000e+00 1.0000000
## Marital.StatusMarried                  5.642297 0.000000e+00 1.0000000
## Marital.StatusUnmarried,               5.686372 2.545574e-05 0.9959744
## Owns.a.Vehicle                         5.155865 1.925321e-04 0.9889292
## Accomodation.TypeRented                5.710139 3.046743e-04 0.9860737
## Accomodation.TypeStaying with Family   5.559199 5.561901e-05 0.9940496
## 
## Likelihood ratio test=4584.737 on 12 df, p=0, n=3479
## Wald test = 222.1264 on 12 df, p = 0
## 
## Covariance-Matrix:
##               [,1]         [,2]          [,3]         [,4]         [,5]
##  [1,]  3.351944472 -1.525338973 -0.9627154244 -1.471078906 -1.187473150
##  [2,] -1.525338973  1.276679867  0.5379148829  0.848115561  0.598466897
##  [3,] -0.962715424  0.537914883  0.9733633264  0.453298774  0.419537494
##  [4,] -1.471078906  0.848115561  0.4532987745  1.318170264  0.735680107
##  [5,] -1.187473150  0.598466897  0.4195374944  0.735680107  1.203631405
##  [6,]  0.003399960 -0.004406942 -0.0100549659  0.003249383 -0.015146548
##  [7,] -0.300111739 -0.004526959 -0.0149915999 -0.003496878  0.010929037
##  [8,] -0.006124246 -0.011279808  0.0060333077 -0.003654881  0.000509647
##  [9,] -0.491606684  0.013888958 -0.0002666287  0.015301253 -0.007365195
## [10,] -0.472794560  0.028441166  0.0124254938 -0.001235783  0.002810468
## [11,] -0.332467999  0.022142678 -0.0025588126  0.010596598  0.032022204
## [12,] -0.500109514  0.011118518  0.0155436496  0.004509845 -0.006900523
## [13,] -0.463610140  0.007012914  0.0025457710 -0.003365407 -0.022335547
##                [,6]          [,7]          [,8]          [,9]
##  [1,]  0.0033999604 -0.3001117389 -0.0061242464 -0.4916066845
##  [2,] -0.0044069419 -0.0045269593 -0.0112798083  0.0138889579
##  [3,] -0.0100549659 -0.0149915999  0.0060333077 -0.0002666287
##  [4,]  0.0032493826 -0.0034968777 -0.0036548814  0.0153012532
##  [5,] -0.0151465476  0.0109290374  0.0005096470 -0.0073651950
##  [6,]  0.1530361474 -0.0018401958  0.0008845845  0.0038434792
##  [7,] -0.0018401958  0.6213961444 -0.0024262655 -0.0059929426
##  [8,]  0.0008845845 -0.0024262655  0.1540921422  0.0060181045
##  [9,]  0.0038434792 -0.0059929426  0.0060181045  0.9303921508
## [10,] -0.0001504503 -0.0054270936  0.0098540057  0.4742820357
## [11,] -0.0061586481 -0.0006597107  0.0057645687 -0.0020933397
## [12,]  0.0067433683 -0.0025322422  0.0028830865  0.0322737461
## [13,]  0.0050144031  0.0089898857  0.0153159128 -0.0066742030
##               [,10]         [,11]        [,12]        [,13]
##  [1,] -0.4727945600 -0.3324679993 -0.500109514 -0.463610140
##  [2,]  0.0284411658  0.0221426782  0.011118518  0.007012914
##  [3,]  0.0124254938 -0.0025588126  0.015543650  0.002545771
##  [4,] -0.0012357826  0.0105965982  0.004509845 -0.003365407
##  [5,]  0.0028104685  0.0320222044 -0.006900523 -0.022335547
##  [6,] -0.0001504503 -0.0061586481  0.006743368  0.005014403
##  [7,] -0.0054270936 -0.0006597107 -0.002532242  0.008989886
##  [8,]  0.0098540057  0.0057645687  0.002883087  0.015315913
##  [9,]  0.4742820357 -0.0020933397  0.032273746 -0.006674203
## [10,]  0.9456833237 -0.0232303076 -0.009881895 -0.016487067
## [11,] -0.0232303076  0.6237995117  0.015715154 -0.008013606
## [12,] -0.0098818952  0.0157151540  0.963590978  0.473486299
## [13,] -0.0164870674 -0.0080136057  0.473486299  0.914223349

betas <- coef(lr2)
X <- model.matrix(lr2, data=test)
probs <- 1 / (1 + exp(-X %*% betas))
#table(test$Fraud.Instance, probs>0.5)
accuracy <- sum(test$Fraud.Instance == (probs>0.5)) / nrow(test)
accuracy

## [1] 1

The results show that the predictors Damaged.Item, Item.Not.Avaiable, Item.Not.In.Stock, and Product.Care.Plan received coefficients (weights) that were significantly different from zero such that they had some influence on the final value of Fraud.Instance. The perfect prediction is most likely the result of this dataset being artificially generated.

Catch The Fraudster

Michael Boerrigter

June 24, 2017