Preface

This R script has been developed in the context of the challenge “Catch the fraud”.
Goal is is to develop an algoritm to predict fraud cases.
Provided is a csv file with fraud information and variables giving information over the persons.
Data is present in file “Fraud Instance Raw Data.csv”

Load the data

After loading the data some summary information of the data is shown.
The first column contains the unique identifier of the case which we will ignore.
The second column contains the fraud information (0 is no, 1 is yes).

fraud <- read.csv("Fraud Instance Raw Data.csv")
str(fraud)
## 'data.frame':    4349 obs. of  13 variables:
##  $ X.               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Fraud.Instance   : int  0 0 0 1 0 0 1 0 1 1 ...
##  $ Damaged.Item     : int  1 0 1 0 1 1 0 0 1 0 ...
##  $ Item.Not.Avaiable: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Item.Not.In.Stock: int  1 1 0 1 0 1 0 1 0 0 ...
##  $ Product.Care.Plan: int  0 1 1 0 1 0 1 0 0 1 ...
##  $ Claim.Amount     : Factor w/ 321 levels "$100 ","$101 ",..: 311 191 289 252 198 274 195 230 23 147 ...
##  $ Registered.Online: int  1 0 0 0 0 1 1 1 1 0 ...
##  $ Age.Group        : int  29 33 39 49 37 25 55 34 49 42 ...
##  $ Marital.Status   : Factor w/ 3 levels "In-Relationship",..: 1 2 2 1 1 1 1 2 2 1 ...
##  $ Owns.a.Vehicle   : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ Accomodation.Type: Factor w/ 3 levels "Owns a house",..: 1 3 3 2 3 1 1 2 2 1 ...
##  $ Height..cms.     : int  155 178 156 187 184 157 173 169 185 159 ...
summary(fraud)
##        X.       Fraud.Instance    Damaged.Item    Item.Not.Avaiable
##  Min.   :   1   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:1088   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   
##  Median :2175   Median :0.0000   Median :1.0000   Median :0.0000   
##  Mean   :2175   Mean   :0.3923   Mean   :0.6691   Mean   :0.1398   
##  3rd Qu.:3262   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   
##  Max.   :4349   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##                                                                    
##  Item.Not.In.Stock Product.Care.Plan  Claim.Amount  Registered.Online
##  Min.   :0.0000    Min.   :0.0000    $250   :  25   Min.   :0.0000   
##  1st Qu.:0.0000    1st Qu.:0.0000    $252   :  23   1st Qu.:0.0000   
##  Median :0.0000    Median :0.0000    $177   :  22   Median :0.0000   
##  Mean   :0.4992    Mean   :0.2996    $246   :  22   Mean   :0.4914   
##  3rd Qu.:1.0000    3rd Qu.:1.0000    $280   :  22   3rd Qu.:1.0000   
##  Max.   :1.0000    Max.   :1.0000    $112   :  21   Max.   :1.0000   
##                                      (Other):4214                    
##    Age.Group            Marital.Status Owns.a.Vehicle  
##  Min.   :18.0   In-Relationship:1408   Min.   :0.0000  
##  1st Qu.:27.0   Married        :1503   1st Qu.:0.0000  
##  Median :36.0   Unmarried,     :1438   Median :0.0000  
##  Mean   :36.4                          Mean   :0.4976  
##  3rd Qu.:46.0                          3rd Qu.:1.0000  
##  Max.   :55.0                          Max.   :1.0000  
##                                                        
##            Accomodation.Type  Height..cms.  
##  Owns a house       :1441    Min.   :150.0  
##  Rented             :1409    1st Qu.:160.0  
##  Staying with Family:1499    Median :170.0  
##                              Mean   :170.3  
##                              3rd Qu.:181.0  
##                              Max.   :190.0  
## 
table(is.na(fraud))
## 
## FALSE 
## 56537

Is the data tidy?

We see that there are no NA’s in our dataset so that’s nice. After reading in we see a lot of variables with values 0 and 1. We keep it that way so we can use th lm function for prediction. Furthermore we see the Claim.Amount as a factor with $-sign and we want that to be numeric

Clean the data

After cleaning the data we show the summary information again.

# Converting categorical columns numeric
fraud$Claim.Amount <- as.numeric(fraud$Claim.Amount)
fraud$Marital.Status <- as.numeric(fraud$Marital.Status)
fraud$Accomodation.Type <- as.numeric(fraud$Accomodation.Type)
hist(fraud$Claim.Amount)

hist(fraud$Height..cms)

str(fraud)
## 'data.frame':    4349 obs. of  13 variables:
##  $ X.               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Fraud.Instance   : int  0 0 0 1 0 0 1 0 1 1 ...
##  $ Damaged.Item     : int  1 0 1 0 1 1 0 0 1 0 ...
##  $ Item.Not.Avaiable: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Item.Not.In.Stock: int  1 1 0 1 0 1 0 1 0 0 ...
##  $ Product.Care.Plan: int  0 1 1 0 1 0 1 0 0 1 ...
##  $ Claim.Amount     : num  311 191 289 252 198 274 195 230 23 147 ...
##  $ Registered.Online: int  1 0 0 0 0 1 1 1 1 0 ...
##  $ Age.Group        : int  29 33 39 49 37 25 55 34 49 42 ...
##  $ Marital.Status   : num  1 2 2 1 1 1 1 2 2 1 ...
##  $ Owns.a.Vehicle   : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ Accomodation.Type: num  1 3 3 2 3 1 1 2 2 1 ...
##  $ Height..cms.     : int  155 178 156 187 184 157 173 169 185 159 ...

We see a roughly even distribution of claim amount and height.

Correlations

We want to know what correlations are present between the different variables, to decide which one to use or not.
The first 2 columns will not be evaluated for correlation.
We will show the correlations in a nice plot.

c <- cor(fraud[3:13], method = "spearman")
mar=c(5.1, 4.1, 4.1, 2.1)
corrplot(c, method = "circle", tl.cex=0.6, tl.srt = 45, tl.col = "black", type= "lower", order="hclust")

We see that “Damaged.item” is somewhat negatiove correlated to 3 other variables and “Product.Care.Plan” to 2 other variables.
Both negative correlations are not too strong, so we will use all variables in the next steps.

Important to mention that columns Marital.status and Accomodation.type are factors with 3 different values. Would they have been ordinal, then they could have been useful in both PCA and linear model. As they are NOT using them in numerical methods is not meaningfull because there is no order in the different values (married is not greater or smaller than divorced). For factors with only 2 values (yes/no) it is feasible to use numeric methods.

Create train and test datasets (75/25) for the 4349 observations

set.seed(3435)

trainIndicator <- rmultinom(nrow(fraud),1,c(0.75,0.25))
fraud.train <- fraud[trainIndicator[1,] == 1,]
fraud.test <- fraud[trainIndicator[2,] == 1,]

Prepare data for Principal components analysis

Remove identifier (1), outcome column (2) and variables not useful in numerical methods (10&12).

fraud.train.pca <- fraud.train[,c(-1,-2,-10,-12)]
prim.comp <- prcomp(fraud.train.pca, scale. = T)
prim.comp$rotation[1:(ncol(fraud.train)-4),1:4]
##                           PC1         PC2         PC3         PC4
## Damaged.Item       0.45655274  0.59011301 -0.07046906  0.05349000
## Item.Not.Avaiable -0.35023282 -0.47770243 -0.20820843 -0.30512026
## Item.Not.In.Stock -0.63809903  0.29257227  0.10812398  0.13320042
## Product.Care.Plan  0.49631105 -0.53597554  0.10878798  0.05919744
## Claim.Amount      -0.02699895  0.03976628  0.52441341 -0.43469283
## Registered.Online -0.02898233  0.04801239 -0.72824207 -0.02240237
## Age.Group          0.03382949  0.17697190  0.03384877 -0.44616894
## Owns.a.Vehicle    -0.06982659 -0.11340061  0.04715703  0.61601259
## Height..cms.      -0.08838802 -0.05131683  0.34560423  0.33881530
par(mar=c(8,4,2,1))
plot(prim.comp$rotation[,1],xaxt="n",xlab="",ylab="Rotation to primary component")
axis(1,at=1:(ncol(fraud.train)-4),labels=as.vector(rownames(prim.comp$rotation)),las=2)
title(main="Influence of variables on primary component")

We see 4 variables having significant influence on the direction of the principal component. We will use them in a normal linear model prediction for the fraud.
For this we split the total cases in a 75% train set and a 25% test set to check the model.

model <- lm(Fraud.Instance ~ Item.Not.In.Stock+ Product.Care.Plan +
                             Damaged.Item + Item.Not.Avaiable,data=fraud.train)
summary(model)
## 
## Call:
## lm(formula = Fraud.Instance ~ Item.Not.In.Stock + Product.Care.Plan + 
##     Damaged.Item + Item.Not.Avaiable, data = fraud.train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.175e-12 -1.000e-15  1.000e-15  2.600e-15  3.416e-13 
## 
## Coefficients:
##                     Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)        2.000e+00  3.413e-15  5.860e+14   <2e-16 ***
## Item.Not.In.Stock -1.000e+00  2.480e-15 -4.032e+14   <2e-16 ***
## Product.Care.Plan -1.000e+00  2.653e-15 -3.770e+14   <2e-16 ***
## Damaged.Item      -1.000e+00  2.564e-15 -3.900e+14   <2e-16 ***
## Item.Not.Avaiable -1.000e+00  3.240e-15 -3.087e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.649e-14 on 3246 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.117e+28 on 4 and 3246 DF,  p-value: < 2.2e-16

All p-values of the variables chosen are extremely low, which means they have great significance. Also the (absolute) t-values are huge, meaning that the chance of a false prediction is t times the standard deviation away from the mean, thus extremely small.
This all is summarized in an adjusted r-square of 1.

prediction.train <- predict(model,fraud.train)
predictedFraude <- rep(0,nrow(fraud.train))
predictedFraude[prediction.train > 0.5] <- 1
nrow(fraud.train)
## [1] 3251
p <- table(predictedFraude,fraud.train$Fraud.Instance)
p
##                
## predictedFraude    0    1
##               0 1947    0
##               1    0 1304
colors=c("green","red","red4","green4")
labels <- round(p/sum(p) * 100, 1)
labels <- paste(labels,"%",sep="") # add % to labels 

bp <- pie(p, main="Distribution of good/false predicted", col=colors, labels=labels , cex=1)
legend("topright", c("OK no fraud predicted","NOK fraud predicted","NOK no fraud predicted","OK fraud predicted"), 
       cex = 0.7,fill =colors)

bp
## NULL

We see a perfect prediction of fraud yes or no, a 100% score. Let’s hope this is not due to overfitting. We can check on that with the test set.

Test results

model2 <- lm(Fraud.Instance ~ Item.Not.In.Stock+ Product.Care.Plan +
                              Damaged.Item + Item.Not.Avaiable,data=fraud.test)
summary(model2)
## 
## Call:
## lm(formula = Fraud.Instance ~ Item.Not.In.Stock + Product.Care.Plan + 
##     Damaged.Item + Item.Not.Avaiable, data = fraud.test)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.467e-14 -1.710e-15 -4.400e-16 -1.100e-16  3.569e-13 
## 
## Coefficients:
##                     Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)        2.000e+00  1.604e-15  1.247e+15   <2e-16 ***
## Item.Not.In.Stock -1.000e+00  1.110e-15 -9.012e+14   <2e-16 ***
## Product.Care.Plan -1.000e+00  1.241e-15 -8.058e+14   <2e-16 ***
## Damaged.Item      -1.000e+00  1.182e-15 -8.461e+14   <2e-16 ***
## Item.Not.Avaiable -1.000e+00  1.520e-15 -6.580e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.49e-14 on 1093 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.869e+29 on 4 and 1093 DF,  p-value: < 2.2e-16
prediction.test <- predict(model2,fraud.test)
predictedFraude <- rep(0,nrow(fraud.test))
predictedFraude[model2$fitted > 0.5] <- 1
nrow(fraud.test)
## [1] 1098
p <- table(predictedFraude,fraud.test$Fraud.Instance)
p
##                
## predictedFraude   0   1
##               0 696   0
##               1   0 402
colors=c("green","red","red4","green4")
labels <- round(p/sum(p) * 100, 1)
labels <- paste(labels,"%",sep="") # add % to labels 

bp <- pie(p, main="Distribution of good/false predicted", col=colors, labels=labels , cex=1)
legend("topright", c("OK no fraud predicted","NOK fraud predicted","NOK no fraud predicted","OK fraud predicted"), 
       cex = 0.7,fill =colors)

bp
## NULL

We are lucky !! On the test set the same score of 100% right prediction shows up.

So a good way to predict fraud with this dataset is using the knowledge from four variables:

1 Item.Not.In.Stock
2 Product.Care.Plan
3 Damaged.Item
4 Item.Not.Avaiable