This is a demo in RStats of predicting credit card fraud in a dataset. We will be going through the outlined roadmap below, which includes several steps. Of note, we are using a publicly available, anonymized dataset “creditcard.csv.”
The dataset has already gone through PCA (Principle Component Analysis). We will discuss how it impacts the analysis. Additionally, the target, Independent Variable (IV) is labeled “Class,” and is the Fraud target. It is already labeled in the set as ‘0’ or ‘1.’ We will discuss how this would be deployed in an enterprise environment.
library(dplyr) # for data manipulation
library(stringr) # for data manipulation
library(caret) # for sampling
library(caTools) # for train/test split
library(ggplot2) # for data visualization
library(corrplot) # for correlations
library(Rtsne) # for tsne plotting
library(rpart)# for decision tree model
library(Rborist)# for random forest model
library(xgboost) # for xgboost model
library(C50)
library(caret)
library(ggplot2)
library(ROSE)
library(repr)
library(RColorBrewer)
library(e1071)
library(C50)
library(factoextra)
library(ROCR)
library(pROC)
library(nnet)
library(tidyr)
library(ggplot2)
library(lattice)
df <- read.csv("/cloud/project/ccard.csv", header=TRUE, stringsAsFactors=FALSE)
head(df,5)
## Time V1 V2 V3 V4 V5 V6
## 1 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5 2 -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146
## V7 V8 V9 V10 V11 V12
## 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## V13 V14 V15 V16 V17 V18 V19
## 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.2079712 0.02579058 0.4039930
## 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.1148047 -0.18336127 -0.1457830
## 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.1099694 -0.12135931 -2.2618571
## 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.6840928 1.96577500 -1.2326220
## 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.2370332 -0.03819479 0.8034869
## V20 V21 V22 V23 V24 V25
## 1 0.25141210 -0.018306778 0.277837576 -0.1104739 0.06692807 0.1285394
## 2 -0.06908314 -0.225775248 -0.638671953 0.1012880 -0.33984648 0.1671704
## 3 0.52497973 0.247998153 0.771679402 0.9094123 -0.68928096 -0.3276418
## 4 -0.20803778 -0.108300452 0.005273597 -0.1903205 -1.17557533 0.6473760
## 5 0.40854236 -0.009430697 0.798278495 -0.1374581 0.14126698 -0.2060096
## V26 V27 V28 Amount Class
## 1 -0.1891148 0.133558377 -0.02105305 149.62 0
## 2 0.1258945 -0.008983099 0.01472417 2.69 0
## 3 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 4 -0.2219288 0.062722849 0.06145763 123.50 0
## 5 0.5022922 0.219422230 0.21515315 69.99 0
Because the variables have already been normalized, the names are labeled as V1, V2, V3, etc and they are expressed on the same scale, which is why the means are all “0.” This means that when we run our correlation, they will be largely uncorrelated. The only variables not scaled are Amount and Time, which we will do later.
str(df)
## 'data.frame': 284807 obs. of 31 variables:
## $ Time : num 0 0 1 1 2 2 4 7 7 9 ...
## $ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
## $ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
## $ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
## $ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
## $ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
## $ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
## $ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
## $ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
## $ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
## $ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
## $ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
## $ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
## $ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
## $ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
## $ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
## $ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
## $ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
## $ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
## $ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
## $ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
## $ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
## $ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
## $ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
## $ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
## $ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
## $ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
## $ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
## $ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
## $ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
## $ Class : int 0 0 0 0 0 0 0 0 0 0 ...
summary(df)
## Time V1 V2 V3
## Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
## Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
## Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:139320 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
## Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
## V4 V5 V6 V7
## Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
## 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
## Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
## Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
## V8 V9 V10 V11
## Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14 V15
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
## Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
## V16 V17 V18
## Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
## 1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
## Median : 0.06641 Median : -0.06568 Median :-0.003636
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
## Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
## V19 V20 V21
## Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
## 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
## Median : 0.003735 Median : -0.06248 Median : -0.02945
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
## Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
## V22 V23 V24
## Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
## 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
## Median : 0.006782 Median : -0.01119 Median : 0.04098
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
## Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
## V25 V26 V27
## Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
## 1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
## Median : 0.01659 Median :-0.05214 Median : 0.001342
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
## Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
## V28 Amount Class
## Min. :-15.43008 Min. : 0.00 Min. :0.000000
## 1st Qu.: -0.05296 1st Qu.: 5.60 1st Qu.:0.000000
## Median : 0.01124 Median : 22.00 Median :0.000000
## Mean : 0.00000 Mean : 88.35 Mean :0.001728
## 3rd Qu.: 0.07828 3rd Qu.: 77.17 3rd Qu.:0.000000
## Max. : 33.84781 Max. :25691.16 Max. :1.000000
sum(is.na(df))
## [1] 0
df$Class<-as.factor(df$Class)
prop.table(table(df$Class))
##
## 0 1
## 0.998272514 0.001727486
counts <- table(df$Class)
barplot(counts, main="Fraud Distribution",
xlab="Counts of Fraud", col="purple")
From this we can see the very uneven representation of credit card fraud in the data, which is usual. This means that the prediction will always inclined towards the majority class.
We will have to address this. This imbalance in the dataset will lead to imbalance in u might see a high accuracy number, but we should not befooled by that number. Because the sensitivity is low. Therefore, we have to address this imbalance in the dataset before going further.
hist(df$Amount, col = 'purple', border = "white")
hist(df$Time, col = 'purple', border = "white")
a <- colnames(df)[2:29]
par(mfrow = c(1, 4))
for (i in 1:length(a)){
sub = df[a[i]][,1]
hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
}
Now we are going to scale the variables Time and Amount, as the other variables are. We are using a Z Score formula to set the means to “0” and standardize the variables.
df[1 : 30] <- as.data.frame(scale(df[1 : 30]))
head(df,5)
## Time V1 V2 V3 V4 V5
## 1 -1.996580 -0.6942411 -0.04407485 1.6727706 0.9733638 -0.245116153
## 2 -1.996580 0.6084953 0.16117564 0.1097969 0.3165224 0.043483276
## 3 -1.996558 -0.6934992 -0.81157640 1.1694664 0.2682308 -0.364571146
## 4 -1.996558 -0.4933240 -0.11216923 1.1825144 -0.6097256 -0.007468867
## 5 -1.996537 -0.5913287 0.53154012 1.0214099 0.2846549 -0.295014918
## V6 V7 V8 V9 V10 V11
## 1 0.34706734 0.1936786 0.08263713 0.3311272 0.08338540 -0.5404061
## 2 -0.06181986 -0.0637001 0.07125336 -0.2324938 -0.15334936 1.5800001
## 3 1.35145121 0.6397745 0.20737237 -1.3786729 0.19069928 0.6118286
## 4 0.93614819 0.1920703 0.31601704 -1.2625010 -0.05046786 -0.2218912
## 5 0.07199846 0.4793014 -0.22650983 0.7443250 0.69162382 -0.8061452
## V12 V13 V14 V15 V16 V17 V18
## 1 -0.6182946 -0.9960972 -0.3246096 1.6040110 -0.5368319 0.2448630 0.03076988
## 2 1.0660867 0.4914173 -0.1499822 0.6943592 0.5294328 -0.1351697 -0.21876220
## 3 0.0661365 0.7206986 -0.1731136 2.5629017 -3.2982296 1.3068656 -0.14478974
## 4 0.1783707 0.5101678 -0.3003600 -0.6898362 -1.2092939 -0.8054432 2.34530040
## 5 0.5386257 1.3522420 -1.1680315 0.1913231 -0.5152042 -0.2790803 -0.04556892
## V19 V20 V21 V22 V23 V24
## 1 0.4962812 0.32611744 -0.02492332 0.382853766 -0.1769110 0.1105067
## 2 -0.1790857 -0.08961071 -0.30737626 -0.880075209 0.1622009 -0.5611296
## 3 -2.7785560 0.68097378 0.33763110 1.063356404 1.4563172 -1.1380901
## 4 -1.5142023 -0.26985475 -0.14744304 0.007266895 -0.3047760 -1.9410237
## 5 0.9870356 0.52993786 -0.01283920 1.100009340 -0.2201230 0.2332497
## V25 V26 V27 V28 Amount Class
## 1 0.2465850 -0.3921697 0.33089104 -0.06378104 0.24496383 0
## 2 0.3206933 0.2610690 -0.02225564 0.04460744 -0.34247394 0
## 3 -0.6285356 -0.2884462 -0.13713661 -0.18102051 1.16068389 0
## 4 1.2419015 -0.4602165 0.15539593 0.18618826 0.14053401 0
## 5 -0.3952009 1.0416095 0.54361884 0.65181477 -0.07340321 0
Why are we splitting our data? And why, in our next step are we adjusting the training data so that the “Class” variable, which represents Fraud, will be even in the training set, but will not be altered in the testing set? One answer: “Bayesian Statistics and machine learning.”
In Fisher Statistics, we set up a null hypothesis to be disproved according to a p-value. Fisher is based upon lab conditions where variables are controlled and manipulated.
However,the goal of Bayesian ML is to estimate the likelihood is something that can be estimated from the training data. In machine learning, the model is essentially learning the parameters of the target from the training data. The idea behind ML is that models are built upon the parameters of real-world estimations.
When training a regular machine learning model, it’s an iterative process which updates the model’s parameters in an attempt to maximize the probability of seeing the training data having already seen the model parameters.
set.seed(1234)
index<-createDataPartition(df$Class,p=0.8,list=FALSE)
train<-df[index,]
test<-df [-index,]
prop.table(table(train$Class))
##
## 0 1
## 0.998270762 0.001729238
There are a few ways to balance the dataset on the Class
variable.
Upsampling - this method increases the size of the minority class by
sampling with replacement so that the classes will have the same size.
An advantage of using this method is that it leads to no information
loss. The disadvantage of using this method is that, since oversampling
simply adds replicated observations in original data set, it ends up
adding multiple observations of several types, thus it often leads to
overfitting.
Downsampling - in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample. The problem with this method is that it decreases the size of the overall sample in the training set, which can be a problem sfor some algorithms.
Hybrid Methods - These include ROSE and SMOTE. ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and create new artificial points in the minority class.
We will be using downsampling. Generally, downsampling is recommended over oversampling, as downsampling gives a more accurate representation of the cases within the dataset you are trying to detect/train for.
ggplot(train,aes(x=Amount,y=Time,group=Class))+
geom_point(aes(color=Class))+
scale_color_brewer(palette="Accent")
set.seed(111)
traindown<-downSample(x=train[,-ncol(train)],
y=train$Class)
table(traindown$Class)
##
## 0 1
## 394 394
ggplot(traindown,aes(x=Amount,y=Time,group=Class))+
geom_point(aes(color=Class))+
scale_color_brewer(palette="Accent")
As we can see, in the undersampled, it appears more randomly distributed.
We are going to look at the correlation of the variables with regard particularly to fraud in the train sample which has been balanced (under sampled). Otherwise the sample will be impacted by the imbalance in the classes.
Evaluating a Correlation Matrix ***In reading a correlation matrix < +/-.29 is considered a LOW correlation
between +/- .30 and .49 is a MEDIUM correlation and
between +/- .50 and 1.00 is a HIGH correlation
Pearsons Correlation Test #Correlation Matrix for All Quantitative Variables #drop variables that are not quantitative or are categorical #We are also saving the correlation matrix itself as a dataframe for a visualization
str(traindown)
## 'data.frame': 788 obs. of 31 variables:
## $ Time : num -0.22 0.794 0.885 0.548 -1.924 ...
## $ V1 : num -0.745 0.132 0.931 -0.782 -0.213 ...
## $ V2 : num -0.3798 0.5258 -0.3851 -0.0166 0.5473 ...
## $ V3 : num 1.482 -0.414 -0.366 1.463 0.733 ...
## $ V4 : num -0.581 -0.4 0.239 -1.538 -0.115 ...
## $ V5 : num 0.0479 0.7022 -0.4524 -0.6077 0.2111 ...
## $ V6 : num 0.3126 -0.66 -0.2289 0.0536 -0.0185 ...
## $ V7 : num -0.395 0.775 -0.374 -0.144 0.365 ...
## $ V8 : num 0.4149 -0.128 0.0825 0.0784 0.2601 ...
## $ V9 : num 0.82 -0.178 1.039 -0.217 -0.389 ...
## $ V10 : num -1.0365 -0.8692 0.0106 0.1761 -0.3224 ...
## $ V11 : num 1.558 -0.734 0.423 0.464 1.153 ...
## $ V12 : num 0.907 -0.8598 0.5089 0.0321 -0.157 ...
## $ V13 : num -0.667 -0.82 -1.089 -0.264 -1.329 ...
## $ V14 : num 0.1267 -0.7878 0.4574 -1.1714 0.0565 ...
## $ V15 : num 1.7574 0.5239 0.0981 -2.3711 0.6804 ...
## $ V16 : num -1.465 0.565 0.585 1.255 0.512 ...
## $ V17 : num 0.68909 0.39668 -0.97217 0.01392 -0.00474 ...
## $ V18 : num 0.195 0.555 0.571 -1.308 0.225 ...
## $ V19 : num 1.7199 -0.8332 0.4376 0.8007 -0.0349 ...
## $ V20 : num -0.172 -0.1987 -0.168 -0.108 0.0376 ...
## $ V21 : num 0.1404 0.2755 -0.2229 -0.0167 -0.2969 ...
## $ V22 : num 1.028 0.781 -0.814 0.138 -0.879 ...
## $ V23 : num 0.83108 -0.28355 0.46257 -0.52451 -0.00971 ...
## $ V24 : num -0.467 0.893 -0.685 0.134 -0.658 ...
## $ V25 : num 0.524 -0.267 -0.944 0.704 -0.507 ...
## $ V26 : num -0.916 1.201 -1.383 -0.947 0.226 ...
## $ V27 : num 0.419 -0.0863 0.0113 -2.3522 0.6106 ...
## $ V28 : num -0.0589 0.0414 -0.1042 -1.2205 0.239 ...
## $ Amount: num -0.3492 -0.3096 -0.0255 -0.2293 -0.3144 ...
## $ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
traindown$Class <- as.numeric(as.character(traindown$Class))
class(traindown$Class)
## [1] "numeric"
cormat <- round(cor(traindown),
digits = 2 # rounded to 2 decimals
)
print(cormat)
## Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## Time 1.00 0.25 -0.25 0.16 -0.22 0.30 0.10 0.23 -0.15 0.19 0.23 -0.33
## V1 0.25 1.00 -0.80 0.87 -0.62 0.86 0.30 0.87 -0.11 0.66 0.74 -0.52
## V2 -0.25 -0.80 1.00 -0.85 0.67 -0.77 -0.33 -0.84 0.02 -0.71 -0.77 0.62
## V3 0.16 0.87 -0.85 1.00 -0.77 0.84 0.46 0.88 -0.21 0.76 0.85 -0.71
## V4 -0.22 -0.62 0.67 -0.77 1.00 -0.59 -0.45 -0.71 0.14 -0.80 -0.80 0.80
## V5 0.30 0.86 -0.77 0.84 -0.59 1.00 0.30 0.83 -0.24 0.66 0.75 -0.53
## V6 0.10 0.30 -0.33 0.46 -0.45 0.30 1.00 0.32 -0.54 0.40 0.44 -0.52
## V7 0.23 0.87 -0.84 0.88 -0.71 0.83 0.32 1.00 0.05 0.76 0.86 -0.64
## V8 -0.15 -0.11 0.02 -0.21 0.14 -0.24 -0.54 0.05 1.00 -0.10 -0.08 0.22
## V9 0.19 0.66 -0.71 0.76 -0.80 0.66 0.40 0.76 -0.10 1.00 0.86 -0.71
## V10 0.23 0.74 -0.77 0.85 -0.80 0.75 0.44 0.86 -0.08 0.86 1.00 -0.81
## V11 -0.33 -0.52 0.62 -0.71 0.80 -0.53 -0.52 -0.64 0.22 -0.71 -0.81 1.00
## V12 0.28 0.59 -0.67 0.76 -0.84 0.61 0.52 0.72 -0.20 0.77 0.88 -0.91
## V13 -0.11 -0.06 0.06 -0.07 0.04 -0.12 -0.13 -0.02 0.27 -0.04 -0.05 0.08
## V14 0.20 0.44 -0.58 0.66 -0.81 0.44 0.56 0.55 -0.24 0.70 0.77 -0.90
## V15 -0.13 0.10 -0.14 0.12 -0.10 0.08 -0.08 0.15 0.14 0.09 0.13 -0.03
## V16 0.26 0.63 -0.63 0.72 -0.73 0.69 0.45 0.75 -0.22 0.74 0.86 -0.81
## V17 0.27 0.67 -0.64 0.73 -0.71 0.74 0.44 0.76 -0.26 0.76 0.86 -0.78
## V18 0.29 0.67 -0.62 0.69 -0.64 0.74 0.37 0.76 -0.22 0.72 0.80 -0.68
## V19 -0.08 -0.32 0.27 -0.33 0.33 -0.41 -0.25 -0.38 0.23 -0.36 -0.44 0.42
## V20 -0.04 -0.32 0.26 -0.35 0.28 -0.31 -0.05 -0.37 -0.11 -0.37 -0.38 0.19
## V21 -0.07 0.04 -0.01 0.04 0.00 0.05 -0.07 0.10 0.06 0.15 0.09 0.15
## V22 0.15 -0.04 0.02 -0.06 0.10 -0.09 0.04 -0.15 -0.11 -0.22 -0.21 0.02
## V23 0.07 -0.04 0.10 -0.03 0.02 -0.09 0.32 -0.04 -0.36 -0.06 -0.04 -0.05
## V24 -0.03 -0.04 0.01 0.03 -0.11 -0.11 -0.02 -0.04 0.06 0.04 0.02 -0.12
## V25 -0.17 -0.07 0.12 -0.08 -0.03 -0.09 -0.10 0.07 0.23 -0.01 0.03 0.00
## V26 -0.05 0.03 0.03 -0.04 0.15 0.03 -0.09 0.00 0.04 -0.14 -0.06 0.17
## V27 -0.16 0.18 -0.10 0.07 0.01 0.15 -0.20 0.17 0.31 0.06 0.07 0.21
## V28 0.02 0.15 0.01 0.09 -0.04 0.15 -0.08 0.13 0.02 0.11 0.10 0.05
## Amount 0.02 -0.05 -0.27 -0.01 0.02 -0.12 0.21 0.14 0.02 0.02 0.00 -0.02
## Class -0.17 -0.44 0.49 -0.57 0.71 -0.36 -0.44 -0.48 0.09 -0.57 -0.63 0.70
## V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
## Time 0.28 -0.11 0.20 -0.13 0.26 0.27 0.29 -0.08 -0.04 -0.07 0.15 0.07
## V1 0.59 -0.06 0.44 0.10 0.63 0.67 0.67 -0.32 -0.32 0.04 -0.04 -0.04
## V2 -0.67 0.06 -0.58 -0.14 -0.63 -0.64 -0.62 0.27 0.26 -0.01 0.02 0.10
## V3 0.76 -0.07 0.66 0.12 0.72 0.73 0.69 -0.33 -0.35 0.04 -0.06 -0.03
## V4 -0.84 0.04 -0.81 -0.10 -0.73 -0.71 -0.64 0.33 0.28 0.00 0.10 0.02
## V5 0.61 -0.12 0.44 0.08 0.69 0.74 0.74 -0.41 -0.31 0.05 -0.09 -0.09
## V6 0.52 -0.13 0.56 -0.08 0.45 0.44 0.37 -0.25 -0.05 -0.07 0.04 0.32
## V7 0.72 -0.02 0.55 0.15 0.75 0.76 0.76 -0.38 -0.37 0.10 -0.15 -0.04
## V8 -0.20 0.27 -0.24 0.14 -0.22 -0.26 -0.22 0.23 -0.11 0.06 -0.11 -0.36
## V9 0.77 -0.04 0.70 0.09 0.74 0.76 0.72 -0.36 -0.37 0.15 -0.22 -0.06
## V10 0.88 -0.05 0.77 0.13 0.86 0.86 0.80 -0.44 -0.38 0.09 -0.21 -0.04
## V11 -0.91 0.08 -0.90 -0.03 -0.81 -0.78 -0.68 0.42 0.19 0.15 0.02 -0.05
## V12 1.00 -0.09 0.89 0.04 0.90 0.88 0.80 -0.48 -0.21 -0.10 -0.09 0.02
## V13 -0.09 1.00 -0.02 -0.02 -0.11 -0.14 -0.14 0.19 -0.07 0.06 -0.01 -0.10
## V14 0.89 -0.02 1.00 -0.03 0.78 0.74 0.63 -0.38 -0.13 -0.23 0.07 0.02
## V15 0.04 -0.02 -0.03 1.00 -0.02 0.00 0.00 0.22 -0.14 0.16 -0.11 -0.06
## V16 0.90 -0.11 0.78 -0.02 1.00 0.95 0.91 -0.64 -0.19 -0.15 -0.11 0.03
## V17 0.88 -0.14 0.74 0.00 0.95 1.00 0.94 -0.61 -0.21 -0.11 -0.12 0.03
## V18 0.80 -0.14 0.63 0.00 0.91 0.94 1.00 -0.57 -0.18 -0.08 -0.12 0.04
## V19 -0.48 0.19 -0.38 0.22 -0.64 -0.61 -0.57 1.00 0.04 0.12 0.13 -0.03
## V20 -0.21 -0.07 -0.13 -0.14 -0.19 -0.21 -0.18 0.04 1.00 -0.49 0.38 0.21
## V21 -0.10 0.06 -0.23 0.16 -0.15 -0.11 -0.08 0.12 -0.49 1.00 -0.70 0.05
## V22 -0.09 -0.01 0.07 -0.11 -0.11 -0.12 -0.12 0.13 0.38 -0.70 1.00 0.08
## V23 0.02 -0.10 0.02 -0.06 0.03 0.03 0.04 -0.03 0.21 0.05 0.08 1.00
## V24 0.05 0.09 0.14 0.00 -0.04 -0.07 -0.11 0.09 -0.07 -0.03 0.05 -0.04
## V25 0.04 0.04 -0.05 -0.03 0.08 0.04 0.07 -0.17 0.04 0.13 -0.22 0.16
## V26 -0.15 0.03 -0.21 0.06 -0.09 -0.07 -0.04 0.07 0.02 0.03 0.03 0.02
## V27 -0.09 0.04 -0.23 0.15 -0.08 -0.06 -0.01 0.07 -0.08 0.37 -0.34 -0.20
## V28 -0.04 -0.13 -0.15 0.12 -0.01 0.03 0.08 -0.03 0.05 0.27 -0.23 0.03
## Amount 0.02 -0.02 0.05 0.04 0.00 -0.02 0.00 -0.03 0.27 0.05 -0.04 -0.06
## Class -0.69 -0.06 -0.76 0.00 -0.59 -0.55 -0.46 0.27 0.16 0.12 -0.01 -0.03
## V24 V25 V26 V27 V28 Amount Class
## Time -0.03 -0.17 -0.05 -0.16 0.02 0.02 -0.17
## V1 -0.04 -0.07 0.03 0.18 0.15 -0.05 -0.44
## V2 0.01 0.12 0.03 -0.10 0.01 -0.27 0.49
## V3 0.03 -0.08 -0.04 0.07 0.09 -0.01 -0.57
## V4 -0.11 -0.03 0.15 0.01 -0.04 0.02 0.71
## V5 -0.11 -0.09 0.03 0.15 0.15 -0.12 -0.36
## V6 -0.02 -0.10 -0.09 -0.20 -0.08 0.21 -0.44
## V7 -0.04 0.07 0.00 0.17 0.13 0.14 -0.48
## V8 0.06 0.23 0.04 0.31 0.02 0.02 0.09
## V9 0.04 -0.01 -0.14 0.06 0.11 0.02 -0.57
## V10 0.02 0.03 -0.06 0.07 0.10 0.00 -0.63
## V11 -0.12 0.00 0.17 0.21 0.05 -0.02 0.70
## V12 0.05 0.04 -0.15 -0.09 -0.04 0.02 -0.69
## V13 0.09 0.04 0.03 0.04 -0.13 -0.02 -0.06
## V14 0.14 -0.05 -0.21 -0.23 -0.15 0.05 -0.76
## V15 0.00 -0.03 0.06 0.15 0.12 0.04 0.00
## V16 -0.04 0.08 -0.09 -0.08 -0.01 0.00 -0.59
## V17 -0.07 0.04 -0.07 -0.06 0.03 -0.02 -0.55
## V18 -0.11 0.07 -0.04 -0.01 0.08 0.00 -0.46
## V19 0.09 -0.17 0.07 0.07 -0.03 -0.03 0.27
## V20 -0.07 0.04 0.02 -0.08 0.05 0.27 0.16
## V21 -0.03 0.13 0.03 0.37 0.27 0.05 0.12
## V22 0.05 -0.22 0.03 -0.34 -0.23 -0.04 -0.01
## V23 -0.04 0.16 0.02 -0.20 0.03 -0.06 -0.03
## V24 1.00 -0.05 -0.12 -0.15 -0.07 -0.05 -0.13
## V25 -0.05 1.00 0.06 0.17 0.15 -0.08 0.02
## V26 -0.12 0.06 1.00 0.17 0.03 -0.06 0.06
## V27 -0.15 0.17 0.17 1.00 0.24 0.03 0.12
## V28 -0.07 0.15 0.03 0.24 1.00 -0.06 0.09
## Amount -0.05 -0.08 -0.06 0.03 -0.06 1.00 0.04
## Class -0.13 0.02 0.06 0.12 0.09 0.04 1.00
M <-cor(traindown)
options(repr.plot.width=35, repr.plot.height=35)
corrplot(M, type="upper", order = "hclust", col=brewer.pal(n=8, name="PiYG"))
V17, V14, V12 and V10 are negatively correlated. Notice how the lower these values are, the more likely the end result will be a fraud transaction.
V2, V4, V11, and V19 are positively correlated. Notice how the higher these values are, the more likely the end result will be a fraud transactions
Leaving Class as numeric for Clustering
km.res <- kmeans(traindown, 3, nstart = 10)
fviz_cluster(km.res, traindown[, -5], ellipse.type = "norm")
Although the subsample is pretty small, the Kmeans algorithm is able to detect clusters pretty accurately
This gives us an indication that further predictive models will perform pretty well in separating fraud cases from non-fraud cases.
We will be changing Class variable back to factor because the algorithms we wish to use need to have a target variable that is a factor variable (Logistic, Decision Trees)
traindown$Class <- as.factor(traindown$Class)
In many instances of modeling, we would want to detect the outliers and remove those cases, such as in psychological studies. However, with several types of verticals, such as sensor fault detection, fraud detection, and disaster risk warning systems it’s the outliers or anomalies that are of most interest, as they often indicate the unusual situation we are trying to detect. Again, this will also inform the algorithm choice. However, we will run an outlier detection routine to view the outliers in our dataset.
# create detect outlier function
detect_outlier <- function(x) {
# calculate first quantile
Quantile1 <- quantile(x, probs=.25)
# calculate third quantile
Quantile3 <- quantile(x, probs=.75)
# calculate interquartile range
IQR = Quantile3 - Quantile1
# return true or false
x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}
remove_outlier <- function(traindown, columns = names(dataframe)) {
# for loop to traverse in columns vector
for (col in columns) {
# remove observation if it satisfies outlier function
traindown <- traindown[!detect_outlier(traindown[[col]]), ]
}
# return dataframe
print("Remove outliers")
head(traindown,5)
}
remove_outlier(traindown,c('V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24','V25','V26','V27','V28','Time','Amount'))
## [1] "Remove outliers"
## Time V1 V2 V3 V4 V5
## 2 0.7938642 0.1317100 0.5258382 -0.4143913 -0.3997963 0.70217527
## 3 0.8847711 0.9309564 -0.3851299 -0.3655774 0.2385870 -0.45239911
## 5 -1.9241193 -0.2128161 0.5472595 0.7330370 -0.1148105 0.21113849
## 7 -0.5321088 -0.4544119 0.1833689 1.5995685 0.1161120 0.04634072
## 13 0.8342533 0.8386033 0.1204831 -0.5899095 2.6304709 0.45016039
## V6 V7 V8 V9 V10 V11
## 2 -0.65996511 0.77468106 -0.128018995 -0.1784619 -0.86920371 -0.7342596
## 3 -0.22888926 -0.37423865 0.082488530 1.0390051 0.01055984 0.4231858
## 5 -0.01847267 0.36481386 0.260136357 -0.3889529 -0.32241282 1.1531301
## 7 0.10868678 -0.06680613 0.326499077 0.1426745 -0.62806958 0.2808093
## 13 0.31974095 0.32633809 -0.001117488 -1.1987267 1.38071689 0.5551154
## V12 V13 V14 V15 V16 V17
## 2 -0.8597611 -0.8200004 -0.78783645 0.5238678 0.5647907 0.396683247
## 3 0.5088843 -1.0893732 0.45737714 0.0980693 0.5853090 -0.972171149
## 5 -0.1570425 -1.3288297 0.05654065 0.6804164 0.5123328 -0.004735056
## 7 0.5250424 -0.5926783 -0.21898132 -1.1436442 0.5856412 -0.952697112
## 13 0.8643520 0.4576969 0.35062848 -2.0314462 0.9367574 -1.073576837
## V18 V19 V20 V21 V22 V23
## 2 0.5554989 -0.83316850 -0.19866615 0.2754933 0.7810111 -0.283545336
## 3 0.5714500 0.43755314 -0.16801094 -0.2228802 -0.8136636 0.462568554
## 5 0.2249984 -0.03490229 0.03757060 -0.2969483 -0.8785076 -0.009712052
## 7 0.8498477 -0.32515944 -0.09061588 0.1009944 0.2667148 -0.296062393
## 13 -0.4457044 -1.26103240 0.04252027 -0.1381251 -0.8001834 0.333833597
## V24 V25 V26 V27 V28 Amount Class
## 2 0.89267505 -0.26688768 1.2014236 -0.08625353 0.04143969 -0.30964971 0
## 3 -0.68535469 -0.94366563 -1.3831463 0.01129131 -0.10418028 -0.02546624 0
## 5 -0.65790428 -0.50681543 0.2259221 0.61058710 0.23896781 -0.31436744 0
## 7 -0.01999072 0.06949997 -1.1400439 0.30674777 0.35328760 -0.31328796 0
## 13 1.18596016 -0.37551862 -0.6966841 -0.16897848 -0.07705673 0.18399313 0
We are going to fit a Logistic Regression Model.How do we know which algorithms to try?
When selecting an algorithm or family of algorithms, you consider a few things:
The question that you are answering - in this case we are answering if a transaction is fraud or not fraud - this is a classification problem - classification problems solve discrete (yes/no or binary, or multiple classification outcome problems)
We also must consider the type of data which we have - in our case we have continuous (not mixed) predictor variables and no missing. Often times you might have missing data. Sometimes you will choose to delete or impute those values. Often times in fraud those missing values can be a signal. That is for another discussion. Some algorithms handle missing data better than other. However, we do not have any. So, we do not have to consider this.
Logistic regression is conceptually similar to linear regression, where linear regression estimates the target variable. Instead of predicting values, as in the linear regression, logistic regression would estimate the odds of a certain event occurring.
Since we are predicting fraud, with logistic regression we are actually estimating the odds of fraud occuring.
Another one we will try is SVM or Support Vector Machine. An SVM algorithm is a machine learning algorithm that uses supervised learning models to solve complex classification problems. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
LR_Model <- glm(formula = Class ~ V17 + V14 + V12 + V10 + V2 + V4 + V11 + V19, data = subtraindown,family = "binomial")
Valid_trainData <- predict(LR_Model, newdata = subtraindown, type = "response") #prediction threshold
Valid_trainData <- ifelse(Valid_trainData > 0.5, 1, 0) # set binary
confusion_Mat<- confusionMatrix(as.factor(subtraindown$Class),as.factor(Valid_trainData))
print(confusion_Mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 388 6
## 1 30 364
##
## Accuracy : 0.9543
## 95% CI : (0.9373, 0.9678)
## No Information Rate : 0.5305
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9086
##
## Mcnemar's Test P-Value : 0.0001264
##
## Sensitivity : 0.9282
## Specificity : 0.9838
## Pos Pred Value : 0.9848
## Neg Pred Value : 0.9239
## Prevalence : 0.5305
## Detection Rate : 0.4924
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9560
##
## 'Positive' Class : 0
##
Overall, this is a very high-performing model. In looking at the confusion matrix where there is room for improvment is that there is some misclassification of fraud as non-fraud predictions.
testData_Class_predicted <- predict(LR_Model, newdata = test, type = "response")
testData_Class_predicted <- ifelse(testData_Class_predicted > 0.5, 1, 0) # set binary prediction threshold
conMat<- confusionMatrix(as.factor(test$Class),as.factor(testData_Class_predicted))
Regression_Acc_Test <-round(conMat$overall["Accuracy"]*100,2)
paste('Model Test Accuracy =', Regression_Acc_Test)
## [1] "Model Test Accuracy = 96.77"
roc_object <- roc( test$Class, testData_Class_predicted)
auc(roc_object)
## Area under the curve: 0.9278
The ROC curve helps us visualize the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate) for various threshold values. A perfect classifier would have an ROC curve that passes through the top-left corner of the plot (100% sensitivity and 100% specificity). The area under the ROC curve (AUC) is a scalar value that summarizes the performance of the classifier. An AUC of 1.0 indicates a perfect classifier, while an AUC of 0.5 suggests that the classifier is no better than random chance. Ours is 93%.
The other “good news” is that because our prediction on the train and the prediction on the test are so close, the likelihood of over-fitting is less likely.
SVM_model<- svm(Class ~ V17 + V14 + V12 + V10 + V2 + V4 + V11 + V19, data=subtraindown, kernel = 'radial', type="C-classification")
Valid_trainData <- predict(SVM_model, subtraindown)
confusion_Mat<- confusionMatrix(as.factor(subtraindown$Class), as.factor(Valid_trainData))
print(confusion_Mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 393 1
## 1 40 354
##
## Accuracy : 0.948
## 95% CI : (0.9301, 0.9624)
## No Information Rate : 0.5495
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8959
##
## Mcnemar's Test P-Value : 2.946e-09
##
## Sensitivity : 0.9076
## Specificity : 0.9972
## Pos Pred Value : 0.9975
## Neg Pred Value : 0.8985
## Prevalence : 0.5495
## Detection Rate : 0.4987
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9524
##
## 'Positive' Class : 0
##
AVM_Acc_Train <- round(confusion_Mat$overall["Accuracy"]*100,4)
paste('Model Train Accuracy =', AVM_Acc_Train)
## [1] "Model Train Accuracy = 94.797"
Test_Fraud_predicted <- predict(SVM_model, test) #produce confusion matrix
conMat<- confusionMatrix(as.factor(test$Class), as.factor(Test_Fraud_predicted))
AVM_Acc_Test <- round(conMat$overall["Accuracy"]*100,4)
paste('Model Test Accuracy =', AVM_Acc_Test)
## [1] "Model Test Accuracy = 98.6148"
SVM_Acc <- c(AVM_Acc_Train, AVM_Acc_Test)
SVM_Acc
## Accuracy Accuracy
## 94.7970 98.6148