This R script has been developed in the context of the challenge “Catch the fraud”.
Goal is is to develop an algoritm to predict fraud cases.
Provided is a csv file with fraud information and variables giving information over the persons.
Data is present in file “Fraud Instance Raw Data.csv”
After loading the data some summary information of the data is shown.
The first column contains the unique identifier of the case which we will ignore.
The second column contains the fraud information (0 is no, 1 is yes).
fraud <- read.csv("Fraud Instance Raw Data.csv")
str(fraud)
## 'data.frame': 4349 obs. of 13 variables:
## $ X. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Fraud.Instance : int 0 0 0 1 0 0 1 0 1 1 ...
## $ Damaged.Item : int 1 0 1 0 1 1 0 0 1 0 ...
## $ Item.Not.Avaiable: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Item.Not.In.Stock: int 1 1 0 1 0 1 0 1 0 0 ...
## $ Product.Care.Plan: int 0 1 1 0 1 0 1 0 0 1 ...
## $ Claim.Amount : Factor w/ 321 levels "$100 ","$101 ",..: 311 191 289 252 198 274 195 230 23 147 ...
## $ Registered.Online: int 1 0 0 0 0 1 1 1 1 0 ...
## $ Age.Group : int 29 33 39 49 37 25 55 34 49 42 ...
## $ Marital.Status : Factor w/ 3 levels "In-Relationship",..: 1 2 2 1 1 1 1 2 2 1 ...
## $ Owns.a.Vehicle : int 1 1 1 1 1 1 1 1 1 0 ...
## $ Accomodation.Type: Factor w/ 3 levels "Owns a house",..: 1 3 3 2 3 1 1 2 2 1 ...
## $ Height..cms. : int 155 178 156 187 184 157 173 169 185 159 ...
summary(fraud)
## X. Fraud.Instance Damaged.Item Item.Not.Avaiable
## Min. : 1 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1088 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2175 Median :0.0000 Median :1.0000 Median :0.0000
## Mean :2175 Mean :0.3923 Mean :0.6691 Mean :0.1398
## 3rd Qu.:3262 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :4349 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Item.Not.In.Stock Product.Care.Plan Claim.Amount Registered.Online
## Min. :0.0000 Min. :0.0000 $250 : 25 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 $252 : 23 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 $177 : 22 Median :0.0000
## Mean :0.4992 Mean :0.2996 $246 : 22 Mean :0.4914
## 3rd Qu.:1.0000 3rd Qu.:1.0000 $280 : 22 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 $112 : 21 Max. :1.0000
## (Other):4214
## Age.Group Marital.Status Owns.a.Vehicle
## Min. :18.0 In-Relationship:1408 Min. :0.0000
## 1st Qu.:27.0 Married :1503 1st Qu.:0.0000
## Median :36.0 Unmarried, :1438 Median :0.0000
## Mean :36.4 Mean :0.4976
## 3rd Qu.:46.0 3rd Qu.:1.0000
## Max. :55.0 Max. :1.0000
##
## Accomodation.Type Height..cms.
## Owns a house :1441 Min. :150.0
## Rented :1409 1st Qu.:160.0
## Staying with Family:1499 Median :170.0
## Mean :170.3
## 3rd Qu.:181.0
## Max. :190.0
##
table(is.na(fraud))
##
## FALSE
## 56537
We see that there are no NA’s in our dataset so that’s nice. After reading in we see a lot of variables with values 0 and 1. We keep it that way so we can use th lm function for prediction. Furthermore we see the Claim.Amount as a factor with $-sign and we want that to be numeric
After cleaning the data we show the summary information again.
# Converting categorical columns numeric
fraud$Claim.Amount <- as.numeric(fraud$Claim.Amount)
fraud$Marital.Status <- as.numeric(fraud$Marital.Status)
fraud$Accomodation.Type <- as.numeric(fraud$Accomodation.Type)
hist(fraud$Claim.Amount)
hist(fraud$Height..cms)
str(fraud)
## 'data.frame': 4349 obs. of 13 variables:
## $ X. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Fraud.Instance : int 0 0 0 1 0 0 1 0 1 1 ...
## $ Damaged.Item : int 1 0 1 0 1 1 0 0 1 0 ...
## $ Item.Not.Avaiable: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Item.Not.In.Stock: int 1 1 0 1 0 1 0 1 0 0 ...
## $ Product.Care.Plan: int 0 1 1 0 1 0 1 0 0 1 ...
## $ Claim.Amount : num 311 191 289 252 198 274 195 230 23 147 ...
## $ Registered.Online: int 1 0 0 0 0 1 1 1 1 0 ...
## $ Age.Group : int 29 33 39 49 37 25 55 34 49 42 ...
## $ Marital.Status : num 1 2 2 1 1 1 1 2 2 1 ...
## $ Owns.a.Vehicle : int 1 1 1 1 1 1 1 1 1 0 ...
## $ Accomodation.Type: num 1 3 3 2 3 1 1 2 2 1 ...
## $ Height..cms. : int 155 178 156 187 184 157 173 169 185 159 ...
We see a roughly even distribution of claim amount and height.
We want to know what correlations are present between the different variables, to decide which one to use or not.
The first 2 columns will not be evaluated for correlation.
We will show the correlations in a nice plot.
c <- cor(fraud[3:13], method = "spearman")
mar=c(5.1, 4.1, 4.1, 2.1)
corrplot(c, method = "circle", tl.cex=0.6, tl.srt = 45, tl.col = "black", type= "lower", order="hclust")
We see that “Damaged.item” is somewhat negatiove correlated to 3 other variables and “Product.Care.Plan” to 2 other variables.
Both negative correlations are not too strong, so we will use all variables in the next steps.
Important to mention that columns Marital.status and Accomodation.type are factors with 3 different values. Would they have been ordinal, then they could have been useful in both PCA and linear model. As they are NOT using them in numerical methods is not meaningfull because there is no order in the different values (married is not greater or smaller than divorced). For factors with only 2 values (yes/no) it is feasible to use numeric methods.
set.seed(3435)
trainIndicator <- rmultinom(nrow(fraud),1,c(0.75,0.25))
fraud.train <- fraud[trainIndicator[1,] == 1,]
fraud.test <- fraud[trainIndicator[2,] == 1,]
Remove identifier (1), outcome column (2) and variables not useful in numerical methods (10&12).
fraud.train.pca <- fraud.train[,c(-1,-2,-10,-12)]
prim.comp <- prcomp(fraud.train.pca, scale. = T)
prim.comp$rotation[1:(ncol(fraud.train)-4),1:4]
## PC1 PC2 PC3 PC4
## Damaged.Item 0.45655274 0.59011301 -0.07046906 0.05349000
## Item.Not.Avaiable -0.35023282 -0.47770243 -0.20820843 -0.30512026
## Item.Not.In.Stock -0.63809903 0.29257227 0.10812398 0.13320042
## Product.Care.Plan 0.49631105 -0.53597554 0.10878798 0.05919744
## Claim.Amount -0.02699895 0.03976628 0.52441341 -0.43469283
## Registered.Online -0.02898233 0.04801239 -0.72824207 -0.02240237
## Age.Group 0.03382949 0.17697190 0.03384877 -0.44616894
## Owns.a.Vehicle -0.06982659 -0.11340061 0.04715703 0.61601259
## Height..cms. -0.08838802 -0.05131683 0.34560423 0.33881530
par(mar=c(8,4,2,1))
plot(prim.comp$rotation[,1],xaxt="n",xlab="",ylab="Rotation to primary component")
axis(1,at=1:(ncol(fraud.train)-4),labels=as.vector(rownames(prim.comp$rotation)),las=2)
title(main="Influence of variables on primary component")
We see 4 variables having significant influence on the direction of the principal component. We will use them in a normal linear model prediction for the fraud.
For this we split the total cases in a 75% train set and a 25% test set to check the model.
model <- lm(Fraud.Instance ~ Item.Not.In.Stock+ Product.Care.Plan +
Damaged.Item + Item.Not.Avaiable,data=fraud.train)
summary(model)
##
## Call:
## lm(formula = Fraud.Instance ~ Item.Not.In.Stock + Product.Care.Plan +
## Damaged.Item + Item.Not.Avaiable, data = fraud.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.175e-12 -1.000e-15 1.000e-15 2.600e-15 3.416e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.000e+00 3.413e-15 5.860e+14 <2e-16 ***
## Item.Not.In.Stock -1.000e+00 2.480e-15 -4.032e+14 <2e-16 ***
## Product.Care.Plan -1.000e+00 2.653e-15 -3.770e+14 <2e-16 ***
## Damaged.Item -1.000e+00 2.564e-15 -3.900e+14 <2e-16 ***
## Item.Not.Avaiable -1.000e+00 3.240e-15 -3.087e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.649e-14 on 3246 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.117e+28 on 4 and 3246 DF, p-value: < 2.2e-16
All p-values of the variables chosen are extremely low, which means they have great significance. Also the (absolute) t-values are huge, meaning that the chance of a false prediction is t times the standard deviation away from the mean, thus extremely small.
This all is summarized in an adjusted r-square of 1.
prediction.train <- predict(model,fraud.train)
predictedFraude <- rep(0,nrow(fraud.train))
predictedFraude[prediction.train > 0.5] <- 1
nrow(fraud.train)
## [1] 3251
p <- table(predictedFraude,fraud.train$Fraud.Instance)
p
##
## predictedFraude 0 1
## 0 1947 0
## 1 0 1304
colors=c("green","red","red4","green4")
labels <- round(p/sum(p) * 100, 1)
labels <- paste(labels,"%",sep="") # add % to labels
bp <- pie(p, main="Distribution of good/false predicted", col=colors, labels=labels , cex=1)
legend("topright", c("OK no fraud predicted","NOK fraud predicted","NOK no fraud predicted","OK fraud predicted"),
cex = 0.7,fill =colors)
bp
## NULL
We see a perfect prediction of fraud yes or no, a 100% score. Let’s hope this is not due to overfitting. We can check on that with the test set.
model2 <- lm(Fraud.Instance ~ Item.Not.In.Stock+ Product.Care.Plan +
Damaged.Item + Item.Not.Avaiable,data=fraud.test)
summary(model2)
##
## Call:
## lm(formula = Fraud.Instance ~ Item.Not.In.Stock + Product.Care.Plan +
## Damaged.Item + Item.Not.Avaiable, data = fraud.test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.467e-14 -1.710e-15 -4.400e-16 -1.100e-16 3.569e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.000e+00 1.604e-15 1.247e+15 <2e-16 ***
## Item.Not.In.Stock -1.000e+00 1.110e-15 -9.012e+14 <2e-16 ***
## Product.Care.Plan -1.000e+00 1.241e-15 -8.058e+14 <2e-16 ***
## Damaged.Item -1.000e+00 1.182e-15 -8.461e+14 <2e-16 ***
## Item.Not.Avaiable -1.000e+00 1.520e-15 -6.580e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.49e-14 on 1093 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.869e+29 on 4 and 1093 DF, p-value: < 2.2e-16
prediction.test <- predict(model2,fraud.test)
predictedFraude <- rep(0,nrow(fraud.test))
predictedFraude[model2$fitted > 0.5] <- 1
nrow(fraud.test)
## [1] 1098
p <- table(predictedFraude,fraud.test$Fraud.Instance)
p
##
## predictedFraude 0 1
## 0 696 0
## 1 0 402
colors=c("green","red","red4","green4")
labels <- round(p/sum(p) * 100, 1)
labels <- paste(labels,"%",sep="") # add % to labels
bp <- pie(p, main="Distribution of good/false predicted", col=colors, labels=labels , cex=1)
legend("topright", c("OK no fraud predicted","NOK fraud predicted","NOK no fraud predicted","OK fraud predicted"),
cex = 0.7,fill =colors)
bp
## NULL
We are lucky !! On the test set the same score of 100% right prediction shows up.
So a good way to predict fraud with this dataset is using the knowledge from four variables:
1 Item.Not.In.Stock
2 Product.Care.Plan
3 Damaged.Item
4 Item.Not.Avaiable