library(ggplot2)
library(plyr)
library(dummies)
library(logistf)
library(rattle)
Load the dataset and display a summary.
df <- read.csv("C:\\Users\\Michael\\Documents\\Data Sources\\DataScienceChallenges\\FraudInstance.csv")
df <- subset(df, select=-c(X.))
summary(df)
## Fraud.Instance Damaged.Item Item.Not.Avaiable Item.Not.In.Stock
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.3923 Mean :0.6691 Mean :0.1398 Mean :0.4992
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Product.Care.Plan Claim.Amount Registered.Online Age.Group
## Min. :0.0000 $250 : 25 Min. :0.0000 Min. :18.0
## 1st Qu.:0.0000 $252 : 23 1st Qu.:0.0000 1st Qu.:27.0
## Median :0.0000 $177 : 22 Median :0.0000 Median :36.0
## Mean :0.2996 $246 : 22 Mean :0.4914 Mean :36.4
## 3rd Qu.:1.0000 $280 : 22 3rd Qu.:1.0000 3rd Qu.:46.0
## Max. :1.0000 $112 : 21 Max. :1.0000 Max. :55.0
## (Other):4214
## Marital.Status Owns.a.Vehicle Accomodation.Type
## In-Relationship:1408 Min. :0.0000 Owns a house :1441
## Married :1503 1st Qu.:0.0000 Rented :1409
## Unmarried, :1438 Median :0.0000 Staying with Family:1499
## Mean :0.4976
## 3rd Qu.:1.0000
## Max. :1.0000
##
## Height..cms.
## Min. :150.0
## 1st Qu.:160.0
## Median :170.0
## Mean :170.3
## 3rd Qu.:181.0
## Max. :190.0
##
Normalize Height, Claim.Amount, and Age.Group and get dummy variables for Marital.Status and Accomodation.Type.
df$Claim.Amount <- as.numeric(substring(df$Claim.Amount, 2))
df <- rename(df, c("Height..cms."="Height"))
df$Claim.Amount <- scale(df$Claim.Amount)
df$Height <- scale(df$Height)
df$Age.Group <- scale(df$Age.Group)
d <- dummy.data.frame(df)
Prepare a training (80% of observations) and a test (20%) set.
set.seed(777)
smpl_size <- floor(0.8 * nrow(d))
train_ind <- sample(seq_len(nrow(d)), size = smpl_size)
train <- df[train_ind, ]
test <- df[-train_ind, ]
Using regular logistic regression (glm), the outcome Fraud.Instance displayed high prevalence which is typically the result when for some combination of the predictors, all the observations have the same event status. In other words, the (continuous) covariates predict the outcome too perfectly. This phenomenon is known as “separation”. One approach to handling this type of problem is to use a penalized likelihood method as proposed by Firth (https://www2.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf) which is available in the logistf package
lr2 <- logistf(Fraud.Instance ~ . -Height, data=train)
summary(lr2)
## logistf(formula = Fraud.Instance ~ . - Height, data = train)
##
## Model fitted by Penalized ML
## Confidence intervals and p-values by Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood Profile Likelihood
##
## coef se(coef) lower 0.95
## (Intercept) 19.562659621 1.8308316 15.632608
## Damaged.Item -13.089230019 1.1299026 -23.149654
## Item.Not.Avaiable -11.642480790 0.9865918 -23.833563
## Item.Not.In.Stock -13.303525399 1.1481160 -22.660583
## Product.Care.Plan -12.787771799 1.0971014 -22.332259
## Claim.Amount 0.034032512 0.3911983 -5.136768
## Registered.Online -0.034256288 0.7882868 -5.195757
## Age.Group -0.001688132 0.3925457 -5.421669
## Marital.StatusMarried 0.002495061 0.9645684 -5.607588
## Marital.StatusUnmarried, 0.011638695 0.9724625 -5.651749
## Owns.a.Vehicle -0.027099770 0.7898098 -5.208164
## Accomodation.TypeRented -0.042417922 0.9816267 -5.794113
## Accomodation.TypeStaying with Family -0.011838261 0.9561503 -5.531582
## upper 0.95 Chisq p
## (Intercept) 31.351060 Inf 0.0000000
## Damaged.Item -10.608883 Inf 0.0000000
## Item.Not.Avaiable -9.708063 Inf 0.0000000
## Item.Not.In.Stock -10.784106 Inf 0.0000000
## Product.Care.Plan -10.475698 Inf 0.0000000
## Claim.Amount 5.379394 9.881027e-04 0.9749233
## Registered.Online 5.126631 3.300075e-04 0.9855063
## Age.Group 5.392377 0.000000e+00 1.0000000
## Marital.StatusMarried 5.642297 0.000000e+00 1.0000000
## Marital.StatusUnmarried, 5.686372 2.545574e-05 0.9959744
## Owns.a.Vehicle 5.155865 1.925321e-04 0.9889292
## Accomodation.TypeRented 5.710139 3.046743e-04 0.9860737
## Accomodation.TypeStaying with Family 5.559199 5.561901e-05 0.9940496
##
## Likelihood ratio test=4584.737 on 12 df, p=0, n=3479
## Wald test = 222.1264 on 12 df, p = 0
##
## Covariance-Matrix:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3.351944472 -1.525338973 -0.9627154244 -1.471078906 -1.187473150
## [2,] -1.525338973 1.276679867 0.5379148829 0.848115561 0.598466897
## [3,] -0.962715424 0.537914883 0.9733633264 0.453298774 0.419537494
## [4,] -1.471078906 0.848115561 0.4532987745 1.318170264 0.735680107
## [5,] -1.187473150 0.598466897 0.4195374944 0.735680107 1.203631405
## [6,] 0.003399960 -0.004406942 -0.0100549659 0.003249383 -0.015146548
## [7,] -0.300111739 -0.004526959 -0.0149915999 -0.003496878 0.010929037
## [8,] -0.006124246 -0.011279808 0.0060333077 -0.003654881 0.000509647
## [9,] -0.491606684 0.013888958 -0.0002666287 0.015301253 -0.007365195
## [10,] -0.472794560 0.028441166 0.0124254938 -0.001235783 0.002810468
## [11,] -0.332467999 0.022142678 -0.0025588126 0.010596598 0.032022204
## [12,] -0.500109514 0.011118518 0.0155436496 0.004509845 -0.006900523
## [13,] -0.463610140 0.007012914 0.0025457710 -0.003365407 -0.022335547
## [,6] [,7] [,8] [,9]
## [1,] 0.0033999604 -0.3001117389 -0.0061242464 -0.4916066845
## [2,] -0.0044069419 -0.0045269593 -0.0112798083 0.0138889579
## [3,] -0.0100549659 -0.0149915999 0.0060333077 -0.0002666287
## [4,] 0.0032493826 -0.0034968777 -0.0036548814 0.0153012532
## [5,] -0.0151465476 0.0109290374 0.0005096470 -0.0073651950
## [6,] 0.1530361474 -0.0018401958 0.0008845845 0.0038434792
## [7,] -0.0018401958 0.6213961444 -0.0024262655 -0.0059929426
## [8,] 0.0008845845 -0.0024262655 0.1540921422 0.0060181045
## [9,] 0.0038434792 -0.0059929426 0.0060181045 0.9303921508
## [10,] -0.0001504503 -0.0054270936 0.0098540057 0.4742820357
## [11,] -0.0061586481 -0.0006597107 0.0057645687 -0.0020933397
## [12,] 0.0067433683 -0.0025322422 0.0028830865 0.0322737461
## [13,] 0.0050144031 0.0089898857 0.0153159128 -0.0066742030
## [,10] [,11] [,12] [,13]
## [1,] -0.4727945600 -0.3324679993 -0.500109514 -0.463610140
## [2,] 0.0284411658 0.0221426782 0.011118518 0.007012914
## [3,] 0.0124254938 -0.0025588126 0.015543650 0.002545771
## [4,] -0.0012357826 0.0105965982 0.004509845 -0.003365407
## [5,] 0.0028104685 0.0320222044 -0.006900523 -0.022335547
## [6,] -0.0001504503 -0.0061586481 0.006743368 0.005014403
## [7,] -0.0054270936 -0.0006597107 -0.002532242 0.008989886
## [8,] 0.0098540057 0.0057645687 0.002883087 0.015315913
## [9,] 0.4742820357 -0.0020933397 0.032273746 -0.006674203
## [10,] 0.9456833237 -0.0232303076 -0.009881895 -0.016487067
## [11,] -0.0232303076 0.6237995117 0.015715154 -0.008013606
## [12,] -0.0098818952 0.0157151540 0.963590978 0.473486299
## [13,] -0.0164870674 -0.0080136057 0.473486299 0.914223349
betas <- coef(lr2)
X <- model.matrix(lr2, data=test)
probs <- 1 / (1 + exp(-X %*% betas))
#table(test$Fraud.Instance, probs>0.5)
accuracy <- sum(test$Fraud.Instance == (probs>0.5)) / nrow(test)
accuracy
## [1] 1
The results show that the predictors Damaged.Item, Item.Not.Avaiable, Item.Not.In.Stock, and Product.Care.Plan received coefficients (weights) that were significantly different from zero such that they had some influence on the final value of Fraud.Instance. The perfect prediction is most likely the result of this dataset being artificially generated.