This is a demo in RStats of predicting credit card fraud in a dataset. We will be going through the outlined roadmap below, which includes several steps. Of note, we are using a publicly available, anonymized dataset “creditcard.csv.”

The dataset has already gone through PCA (Principle Component Analysis). We will discuss how it impacts the analysis. Additionally, the target, Independent Variable (IV) is labeled “Class,” and is the Fraud target. It is already labeled in the set as ‘0’ or ‘1.’ We will discuss how this would be deployed in an enterprise environment.

Calling the Libraries Needed

library(dplyr) # for data manipulation
library(stringr) # for data manipulation
library(caret) # for sampling
library(caTools) # for train/test split
library(ggplot2) # for data visualization
library(corrplot) # for correlations
library(Rtsne) # for tsne plotting
library(rpart)# for decision tree model
library(Rborist)# for random forest model
library(xgboost) # for xgboost model
library(C50)
library(caret)
library(ggplot2)
library(ROSE)
library(repr)
library(RColorBrewer)
library(e1071)
library(C50)
library(factoextra)
library(ROCR)
library(pROC)
library(nnet)
library(tidyr)
library(ggplot2)
library(lattice)

Bringing in the data

df <- read.csv("/cloud/project/ccard.csv", header=TRUE, stringsAsFactors=FALSE)
head(df,5)
##   Time         V1          V2        V3         V4          V5          V6
## 1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5    2 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
##          V13        V14        V15        V16        V17         V18        V19
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.2079712  0.02579058  0.4039930
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.1148047 -0.18336127 -0.1457830
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.1099694 -0.12135931 -2.2618571
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.6840928  1.96577500 -1.2326220
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.2370332 -0.03819479  0.8034869
##           V20          V21          V22        V23         V24        V25
## 1  0.25141210 -0.018306778  0.277837576 -0.1104739  0.06692807  0.1285394
## 2 -0.06908314 -0.225775248 -0.638671953  0.1012880 -0.33984648  0.1671704
## 3  0.52497973  0.247998153  0.771679402  0.9094123 -0.68928096 -0.3276418
## 4 -0.20803778 -0.108300452  0.005273597 -0.1903205 -1.17557533  0.6473760
## 5  0.40854236 -0.009430697  0.798278495 -0.1374581  0.14126698 -0.2060096
##          V26          V27         V28 Amount Class
## 1 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2  0.1258945 -0.008983099  0.01472417   2.69     0
## 3 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4 -0.2219288  0.062722849  0.06145763 123.50     0
## 5  0.5022922  0.219422230  0.21515315  69.99     0

1) EDAs - Exploratory Data Analysis

Because the variables have already been normalized, the names are labeled as V1, V2, V3, etc and they are expressed on the same scale, which is why the means are all “0.” This means that when we run our correlation, they will be largely uncorrelated. The only variables not scaled are Amount and Time, which we will do later.

str(df)
## 'data.frame':    284807 obs. of  31 variables:
##  $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...

Data Summary/EDAs

summary(df)
##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139320   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount             Class         
##  Min.   :-15.43008   Min.   :    0.00   Min.   :0.000000  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1st Qu.:0.000000  
##  Median :  0.01124   Median :   22.00   Median :0.000000  
##  Mean   :  0.00000   Mean   :   88.35   Mean   :0.001728  
##  3rd Qu.:  0.07828   3rd Qu.:   77.17   3rd Qu.:0.000000  
##  Max.   : 33.84781   Max.   :25691.16   Max.   :1.000000

Looking for any missing

sum(is.na(df))
## [1] 0

Now we knew because our data was normalized that there would be no missing, however the code above, if the data was not cleaned already would have told us if there were missing.

When we run our summary we see that our variable for Amount has at least one, if not more, outliers, which could impact our analysis.

Frequency table for class

df$Class<-as.factor(df$Class)
prop.table(table(df$Class))
## 
##           0           1 
## 0.998272514 0.001727486

Barplot of Class

counts <- table(df$Class)
barplot(counts, main="Fraud Distribution",
        xlab="Counts of Fraud", col="purple")

From this we can see the very uneven representation of credit card fraud in the data, which is usual. This means that the prediction will always inclined towards the majority class.

We will have to address this. This imbalance in the dataset will lead to imbalance in u might see a high accuracy number, but we should not befooled by that number. Because the sensitivity is low. Therefore, we have to address this imbalance in the dataset before going further.

Distribution of Amount

hist(df$Amount, col = 'purple', border = "white")

Distribution of Time

hist(df$Time, col = 'purple', border = "white")

Histograms of other continuous vars

a <- colnames(df)[2:29] 
par(mfrow = c(1, 4))    
for (i in 1:length(a)){
  sub = df[a[i]][,1]   
  hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
}

2) Preprocessing/Cleaning the Data

Now we are going to scale the variables Time and Amount, as the other variables are. We are using a Z Score formula to set the means to “0” and standardize the variables.

Scaling Time and Amoung

Applying scaling function

df[1 : 30] <- as.data.frame(scale(df[1 : 30])) 

Displaying result

head(df,5)
##        Time         V1          V2        V3         V4           V5
## 1 -1.996580 -0.6942411 -0.04407485 1.6727706  0.9733638 -0.245116153
## 2 -1.996580  0.6084953  0.16117564 0.1097969  0.3165224  0.043483276
## 3 -1.996558 -0.6934992 -0.81157640 1.1694664  0.2682308 -0.364571146
## 4 -1.996558 -0.4933240 -0.11216923 1.1825144 -0.6097256 -0.007468867
## 5 -1.996537 -0.5913287  0.53154012 1.0214099  0.2846549 -0.295014918
##            V6         V7          V8         V9         V10        V11
## 1  0.34706734  0.1936786  0.08263713  0.3311272  0.08338540 -0.5404061
## 2 -0.06181986 -0.0637001  0.07125336 -0.2324938 -0.15334936  1.5800001
## 3  1.35145121  0.6397745  0.20737237 -1.3786729  0.19069928  0.6118286
## 4  0.93614819  0.1920703  0.31601704 -1.2625010 -0.05046786 -0.2218912
## 5  0.07199846  0.4793014 -0.22650983  0.7443250  0.69162382 -0.8061452
##          V12        V13        V14        V15        V16        V17         V18
## 1 -0.6182946 -0.9960972 -0.3246096  1.6040110 -0.5368319  0.2448630  0.03076988
## 2  1.0660867  0.4914173 -0.1499822  0.6943592  0.5294328 -0.1351697 -0.21876220
## 3  0.0661365  0.7206986 -0.1731136  2.5629017 -3.2982296  1.3068656 -0.14478974
## 4  0.1783707  0.5101678 -0.3003600 -0.6898362 -1.2092939 -0.8054432  2.34530040
## 5  0.5386257  1.3522420 -1.1680315  0.1913231 -0.5152042 -0.2790803 -0.04556892
##          V19         V20         V21          V22        V23        V24
## 1  0.4962812  0.32611744 -0.02492332  0.382853766 -0.1769110  0.1105067
## 2 -0.1790857 -0.08961071 -0.30737626 -0.880075209  0.1622009 -0.5611296
## 3 -2.7785560  0.68097378  0.33763110  1.063356404  1.4563172 -1.1380901
## 4 -1.5142023 -0.26985475 -0.14744304  0.007266895 -0.3047760 -1.9410237
## 5  0.9870356  0.52993786 -0.01283920  1.100009340 -0.2201230  0.2332497
##          V25        V26         V27         V28      Amount Class
## 1  0.2465850 -0.3921697  0.33089104 -0.06378104  0.24496383     0
## 2  0.3206933  0.2610690 -0.02225564  0.04460744 -0.34247394     0
## 3 -0.6285356 -0.2884462 -0.13713661 -0.18102051  1.16068389     0
## 4  1.2419015 -0.4602165  0.15539593  0.18618826  0.14053401     0
## 5 -0.3952009  1.0416095  0.54361884  0.65181477 -0.07340321     0

3) Splitting Data into Train/Test Samples

Why are we splitting our data? And why, in our next step are we adjusting the training data so that the “Class” variable, which represents Fraud, will be even in the training set, but will not be altered in the testing set? One answer: “Bayesian Statistics and machine learning.”

In Fisher Statistics, we set up a null hypothesis to be disproved according to a p-value. Fisher is based upon lab conditions where variables are controlled and manipulated.

However,the goal of Bayesian ML is to estimate the likelihood is something that can be estimated from the training data. In machine learning, the model is essentially learning the parameters of the target from the training data. The idea behind ML is that models are built upon the parameters of real-world estimations.

When training a regular machine learning model, it’s an iterative process which updates the model’s parameters in an attempt to maximize the probability of seeing the training data having already seen the model parameters.

Splitting Into Training/Test Data

set.seed(1234)
index<-createDataPartition(df$Class,p=0.8,list=FALSE)
train<-df[index,]
test<-df [-index,]

Get the count and proportion of each classes

prop.table(table(train$Class))
## 
##           0           1 
## 0.998270762 0.001729238

4) Sampling/Under-Sampling/SMOTE

There are a few ways to balance the dataset on the Class variable.
Upsampling - this method increases the size of the minority class by sampling with replacement so that the classes will have the same size. An advantage of using this method is that it leads to no information loss. The disadvantage of using this method is that, since oversampling simply adds replicated observations in original data set, it ends up adding multiple observations of several types, thus it often leads to overfitting.

Downsampling - in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample. The problem with this method is that it decreases the size of the overall sample in the training set, which can be a problem sfor some algorithms.

Hybrid Methods - These include ROSE and SMOTE. ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and create new artificial points in the minority class.

We will be using downsampling. Generally, downsampling is recommended over oversampling, as downsampling gives a more accurate representation of the cases within the dataset you are trying to detect/train for.

Distribution of Class before Undersampling

ggplot(train,aes(x=Amount,y=Time,group=Class))+
  geom_point(aes(color=Class))+
  scale_color_brewer(palette="Accent")

Now we are going to under-sample to balance our training set

Downsampling to balance our training set

set.seed(111)
traindown<-downSample(x=train[,-ncol(train)],
                      y=train$Class)

table(traindown$Class)
## 
##   0   1 
## 394 394

Now we will check the distribution again

Distribution of Class after undersampling

ggplot(traindown,aes(x=Amount,y=Time,group=Class))+
  geom_point(aes(color=Class))+
  scale_color_brewer(palette="Accent")

As we can see, in the undersampled, it appears more randomly distributed.

5) Correlation Matrices

We are going to look at the correlation of the variables with regard particularly to fraud in the train sample which has been balanced (under sampled). Otherwise the sample will be impacted by the imbalance in the classes.

Evaluating a Correlation Matrix ***In reading a correlation matrix < +/-.29 is considered a LOW correlation

between +/- .30 and .49 is a MEDIUM correlation and

between +/- .50 and 1.00 is a HIGH correlation

Pearsons Correlation Test #Correlation Matrix for All Quantitative Variables #drop variables that are not quantitative or are categorical #We are also saving the correlation matrix itself as a dataframe for a visualization

str(traindown)
## 'data.frame':    788 obs. of  31 variables:
##  $ Time  : num  -0.22 0.794 0.885 0.548 -1.924 ...
##  $ V1    : num  -0.745 0.132 0.931 -0.782 -0.213 ...
##  $ V2    : num  -0.3798 0.5258 -0.3851 -0.0166 0.5473 ...
##  $ V3    : num  1.482 -0.414 -0.366 1.463 0.733 ...
##  $ V4    : num  -0.581 -0.4 0.239 -1.538 -0.115 ...
##  $ V5    : num  0.0479 0.7022 -0.4524 -0.6077 0.2111 ...
##  $ V6    : num  0.3126 -0.66 -0.2289 0.0536 -0.0185 ...
##  $ V7    : num  -0.395 0.775 -0.374 -0.144 0.365 ...
##  $ V8    : num  0.4149 -0.128 0.0825 0.0784 0.2601 ...
##  $ V9    : num  0.82 -0.178 1.039 -0.217 -0.389 ...
##  $ V10   : num  -1.0365 -0.8692 0.0106 0.1761 -0.3224 ...
##  $ V11   : num  1.558 -0.734 0.423 0.464 1.153 ...
##  $ V12   : num  0.907 -0.8598 0.5089 0.0321 -0.157 ...
##  $ V13   : num  -0.667 -0.82 -1.089 -0.264 -1.329 ...
##  $ V14   : num  0.1267 -0.7878 0.4574 -1.1714 0.0565 ...
##  $ V15   : num  1.7574 0.5239 0.0981 -2.3711 0.6804 ...
##  $ V16   : num  -1.465 0.565 0.585 1.255 0.512 ...
##  $ V17   : num  0.68909 0.39668 -0.97217 0.01392 -0.00474 ...
##  $ V18   : num  0.195 0.555 0.571 -1.308 0.225 ...
##  $ V19   : num  1.7199 -0.8332 0.4376 0.8007 -0.0349 ...
##  $ V20   : num  -0.172 -0.1987 -0.168 -0.108 0.0376 ...
##  $ V21   : num  0.1404 0.2755 -0.2229 -0.0167 -0.2969 ...
##  $ V22   : num  1.028 0.781 -0.814 0.138 -0.879 ...
##  $ V23   : num  0.83108 -0.28355 0.46257 -0.52451 -0.00971 ...
##  $ V24   : num  -0.467 0.893 -0.685 0.134 -0.658 ...
##  $ V25   : num  0.524 -0.267 -0.944 0.704 -0.507 ...
##  $ V26   : num  -0.916 1.201 -1.383 -0.947 0.226 ...
##  $ V27   : num  0.419 -0.0863 0.0113 -2.3522 0.6106 ...
##  $ V28   : num  -0.0589 0.0414 -0.1042 -1.2205 0.239 ...
##  $ Amount: num  -0.3492 -0.3096 -0.0255 -0.2293 -0.3144 ...
##  $ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

To run the correlation we have to change the factor Class to numeric

Convert column ‘Class’ from factor to numeric

traindown$Class <- as.numeric(as.character(traindown$Class))

Checking that it went well

class(traindown$Class)
## [1] "numeric"
cormat <- round(cor(traindown),
                digits = 2 # rounded to 2 decimals
)

print(cormat)
##         Time    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11
## Time    1.00  0.25 -0.25  0.16 -0.22  0.30  0.10  0.23 -0.15  0.19  0.23 -0.33
## V1      0.25  1.00 -0.80  0.87 -0.62  0.86  0.30  0.87 -0.11  0.66  0.74 -0.52
## V2     -0.25 -0.80  1.00 -0.85  0.67 -0.77 -0.33 -0.84  0.02 -0.71 -0.77  0.62
## V3      0.16  0.87 -0.85  1.00 -0.77  0.84  0.46  0.88 -0.21  0.76  0.85 -0.71
## V4     -0.22 -0.62  0.67 -0.77  1.00 -0.59 -0.45 -0.71  0.14 -0.80 -0.80  0.80
## V5      0.30  0.86 -0.77  0.84 -0.59  1.00  0.30  0.83 -0.24  0.66  0.75 -0.53
## V6      0.10  0.30 -0.33  0.46 -0.45  0.30  1.00  0.32 -0.54  0.40  0.44 -0.52
## V7      0.23  0.87 -0.84  0.88 -0.71  0.83  0.32  1.00  0.05  0.76  0.86 -0.64
## V8     -0.15 -0.11  0.02 -0.21  0.14 -0.24 -0.54  0.05  1.00 -0.10 -0.08  0.22
## V9      0.19  0.66 -0.71  0.76 -0.80  0.66  0.40  0.76 -0.10  1.00  0.86 -0.71
## V10     0.23  0.74 -0.77  0.85 -0.80  0.75  0.44  0.86 -0.08  0.86  1.00 -0.81
## V11    -0.33 -0.52  0.62 -0.71  0.80 -0.53 -0.52 -0.64  0.22 -0.71 -0.81  1.00
## V12     0.28  0.59 -0.67  0.76 -0.84  0.61  0.52  0.72 -0.20  0.77  0.88 -0.91
## V13    -0.11 -0.06  0.06 -0.07  0.04 -0.12 -0.13 -0.02  0.27 -0.04 -0.05  0.08
## V14     0.20  0.44 -0.58  0.66 -0.81  0.44  0.56  0.55 -0.24  0.70  0.77 -0.90
## V15    -0.13  0.10 -0.14  0.12 -0.10  0.08 -0.08  0.15  0.14  0.09  0.13 -0.03
## V16     0.26  0.63 -0.63  0.72 -0.73  0.69  0.45  0.75 -0.22  0.74  0.86 -0.81
## V17     0.27  0.67 -0.64  0.73 -0.71  0.74  0.44  0.76 -0.26  0.76  0.86 -0.78
## V18     0.29  0.67 -0.62  0.69 -0.64  0.74  0.37  0.76 -0.22  0.72  0.80 -0.68
## V19    -0.08 -0.32  0.27 -0.33  0.33 -0.41 -0.25 -0.38  0.23 -0.36 -0.44  0.42
## V20    -0.04 -0.32  0.26 -0.35  0.28 -0.31 -0.05 -0.37 -0.11 -0.37 -0.38  0.19
## V21    -0.07  0.04 -0.01  0.04  0.00  0.05 -0.07  0.10  0.06  0.15  0.09  0.15
## V22     0.15 -0.04  0.02 -0.06  0.10 -0.09  0.04 -0.15 -0.11 -0.22 -0.21  0.02
## V23     0.07 -0.04  0.10 -0.03  0.02 -0.09  0.32 -0.04 -0.36 -0.06 -0.04 -0.05
## V24    -0.03 -0.04  0.01  0.03 -0.11 -0.11 -0.02 -0.04  0.06  0.04  0.02 -0.12
## V25    -0.17 -0.07  0.12 -0.08 -0.03 -0.09 -0.10  0.07  0.23 -0.01  0.03  0.00
## V26    -0.05  0.03  0.03 -0.04  0.15  0.03 -0.09  0.00  0.04 -0.14 -0.06  0.17
## V27    -0.16  0.18 -0.10  0.07  0.01  0.15 -0.20  0.17  0.31  0.06  0.07  0.21
## V28     0.02  0.15  0.01  0.09 -0.04  0.15 -0.08  0.13  0.02  0.11  0.10  0.05
## Amount  0.02 -0.05 -0.27 -0.01  0.02 -0.12  0.21  0.14  0.02  0.02  0.00 -0.02
## Class  -0.17 -0.44  0.49 -0.57  0.71 -0.36 -0.44 -0.48  0.09 -0.57 -0.63  0.70
##          V12   V13   V14   V15   V16   V17   V18   V19   V20   V21   V22   V23
## Time    0.28 -0.11  0.20 -0.13  0.26  0.27  0.29 -0.08 -0.04 -0.07  0.15  0.07
## V1      0.59 -0.06  0.44  0.10  0.63  0.67  0.67 -0.32 -0.32  0.04 -0.04 -0.04
## V2     -0.67  0.06 -0.58 -0.14 -0.63 -0.64 -0.62  0.27  0.26 -0.01  0.02  0.10
## V3      0.76 -0.07  0.66  0.12  0.72  0.73  0.69 -0.33 -0.35  0.04 -0.06 -0.03
## V4     -0.84  0.04 -0.81 -0.10 -0.73 -0.71 -0.64  0.33  0.28  0.00  0.10  0.02
## V5      0.61 -0.12  0.44  0.08  0.69  0.74  0.74 -0.41 -0.31  0.05 -0.09 -0.09
## V6      0.52 -0.13  0.56 -0.08  0.45  0.44  0.37 -0.25 -0.05 -0.07  0.04  0.32
## V7      0.72 -0.02  0.55  0.15  0.75  0.76  0.76 -0.38 -0.37  0.10 -0.15 -0.04
## V8     -0.20  0.27 -0.24  0.14 -0.22 -0.26 -0.22  0.23 -0.11  0.06 -0.11 -0.36
## V9      0.77 -0.04  0.70  0.09  0.74  0.76  0.72 -0.36 -0.37  0.15 -0.22 -0.06
## V10     0.88 -0.05  0.77  0.13  0.86  0.86  0.80 -0.44 -0.38  0.09 -0.21 -0.04
## V11    -0.91  0.08 -0.90 -0.03 -0.81 -0.78 -0.68  0.42  0.19  0.15  0.02 -0.05
## V12     1.00 -0.09  0.89  0.04  0.90  0.88  0.80 -0.48 -0.21 -0.10 -0.09  0.02
## V13    -0.09  1.00 -0.02 -0.02 -0.11 -0.14 -0.14  0.19 -0.07  0.06 -0.01 -0.10
## V14     0.89 -0.02  1.00 -0.03  0.78  0.74  0.63 -0.38 -0.13 -0.23  0.07  0.02
## V15     0.04 -0.02 -0.03  1.00 -0.02  0.00  0.00  0.22 -0.14  0.16 -0.11 -0.06
## V16     0.90 -0.11  0.78 -0.02  1.00  0.95  0.91 -0.64 -0.19 -0.15 -0.11  0.03
## V17     0.88 -0.14  0.74  0.00  0.95  1.00  0.94 -0.61 -0.21 -0.11 -0.12  0.03
## V18     0.80 -0.14  0.63  0.00  0.91  0.94  1.00 -0.57 -0.18 -0.08 -0.12  0.04
## V19    -0.48  0.19 -0.38  0.22 -0.64 -0.61 -0.57  1.00  0.04  0.12  0.13 -0.03
## V20    -0.21 -0.07 -0.13 -0.14 -0.19 -0.21 -0.18  0.04  1.00 -0.49  0.38  0.21
## V21    -0.10  0.06 -0.23  0.16 -0.15 -0.11 -0.08  0.12 -0.49  1.00 -0.70  0.05
## V22    -0.09 -0.01  0.07 -0.11 -0.11 -0.12 -0.12  0.13  0.38 -0.70  1.00  0.08
## V23     0.02 -0.10  0.02 -0.06  0.03  0.03  0.04 -0.03  0.21  0.05  0.08  1.00
## V24     0.05  0.09  0.14  0.00 -0.04 -0.07 -0.11  0.09 -0.07 -0.03  0.05 -0.04
## V25     0.04  0.04 -0.05 -0.03  0.08  0.04  0.07 -0.17  0.04  0.13 -0.22  0.16
## V26    -0.15  0.03 -0.21  0.06 -0.09 -0.07 -0.04  0.07  0.02  0.03  0.03  0.02
## V27    -0.09  0.04 -0.23  0.15 -0.08 -0.06 -0.01  0.07 -0.08  0.37 -0.34 -0.20
## V28    -0.04 -0.13 -0.15  0.12 -0.01  0.03  0.08 -0.03  0.05  0.27 -0.23  0.03
## Amount  0.02 -0.02  0.05  0.04  0.00 -0.02  0.00 -0.03  0.27  0.05 -0.04 -0.06
## Class  -0.69 -0.06 -0.76  0.00 -0.59 -0.55 -0.46  0.27  0.16  0.12 -0.01 -0.03
##          V24   V25   V26   V27   V28 Amount Class
## Time   -0.03 -0.17 -0.05 -0.16  0.02   0.02 -0.17
## V1     -0.04 -0.07  0.03  0.18  0.15  -0.05 -0.44
## V2      0.01  0.12  0.03 -0.10  0.01  -0.27  0.49
## V3      0.03 -0.08 -0.04  0.07  0.09  -0.01 -0.57
## V4     -0.11 -0.03  0.15  0.01 -0.04   0.02  0.71
## V5     -0.11 -0.09  0.03  0.15  0.15  -0.12 -0.36
## V6     -0.02 -0.10 -0.09 -0.20 -0.08   0.21 -0.44
## V7     -0.04  0.07  0.00  0.17  0.13   0.14 -0.48
## V8      0.06  0.23  0.04  0.31  0.02   0.02  0.09
## V9      0.04 -0.01 -0.14  0.06  0.11   0.02 -0.57
## V10     0.02  0.03 -0.06  0.07  0.10   0.00 -0.63
## V11    -0.12  0.00  0.17  0.21  0.05  -0.02  0.70
## V12     0.05  0.04 -0.15 -0.09 -0.04   0.02 -0.69
## V13     0.09  0.04  0.03  0.04 -0.13  -0.02 -0.06
## V14     0.14 -0.05 -0.21 -0.23 -0.15   0.05 -0.76
## V15     0.00 -0.03  0.06  0.15  0.12   0.04  0.00
## V16    -0.04  0.08 -0.09 -0.08 -0.01   0.00 -0.59
## V17    -0.07  0.04 -0.07 -0.06  0.03  -0.02 -0.55
## V18    -0.11  0.07 -0.04 -0.01  0.08   0.00 -0.46
## V19     0.09 -0.17  0.07  0.07 -0.03  -0.03  0.27
## V20    -0.07  0.04  0.02 -0.08  0.05   0.27  0.16
## V21    -0.03  0.13  0.03  0.37  0.27   0.05  0.12
## V22     0.05 -0.22  0.03 -0.34 -0.23  -0.04 -0.01
## V23    -0.04  0.16  0.02 -0.20  0.03  -0.06 -0.03
## V24     1.00 -0.05 -0.12 -0.15 -0.07  -0.05 -0.13
## V25    -0.05  1.00  0.06  0.17  0.15  -0.08  0.02
## V26    -0.12  0.06  1.00  0.17  0.03  -0.06  0.06
## V27    -0.15  0.17  0.17  1.00  0.24   0.03  0.12
## V28    -0.07  0.15  0.03  0.24  1.00  -0.06  0.09
## Amount -0.05 -0.08 -0.06  0.03 -0.06   1.00  0.04
## Class  -0.13  0.02  0.06  0.12  0.09   0.04  1.00
M <-cor(traindown)

Correlation Plot

options(repr.plot.width=35, repr.plot.height=35)
corrplot(M, type="upper", order = "hclust", col=brewer.pal(n=8, name="PiYG"))

V17, V14, V12 and V10 are negatively correlated. Notice how the lower these values are, the more likely the end result will be a fraud transaction.

V2, V4, V11, and V19 are positively correlated. Notice how the higher these values are, the more likely the end result will be a fraud transactions

Leaving Class as numeric for Clustering

K-means Clustering

km.res <- kmeans(traindown, 3, nstart = 10)

Visualize kmeans clustering

use repel = TRUE to avoid overplotting

fviz_cluster(km.res, traindown[, -5], ellipse.type = "norm")

Although the subsample is pretty small, the Kmeans algorithm is able to detect clusters pretty accurately

This gives us an indication that further predictive models will perform pretty well in separating fraud cases from non-fraud cases.

We will be changing Class variable back to factor because the algorithms we wish to use need to have a target variable that is a factor variable (Logistic, Decision Trees)

Change Class back to factor

traindown$Class <- as.factor(traindown$Class)

In many instances of modeling, we would want to detect the outliers and remove those cases, such as in psychological studies. However, with several types of verticals, such as sensor fault detection, fraud detection, and disaster risk warning systems it’s the outliers or anomalies that are of most interest, as they often indicate the unusual situation we are trying to detect. Again, this will also inform the algorithm choice. However, we will run an outlier detection routine to view the outliers in our dataset.

create detect outlier function

# create detect outlier function
detect_outlier <- function(x) {
  
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
  
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
  
  # calculate interquartile range
  IQR = Quantile3 - Quantile1
  
  # return true or false
  x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}

create remove outlier function

remove_outlier <- function(traindown, columns = names(dataframe)) {
  
  # for loop to traverse in columns vector
  for (col in columns) {
    
    # remove observation if it satisfies outlier function
    traindown <- traindown[!detect_outlier(traindown[[col]]), ]
  }
  
  # return dataframe
  print("Remove outliers")
  head(traindown,5)
}

remove_outlier(traindown,c('V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24','V25','V26','V27','V28','Time','Amount'))
## [1] "Remove outliers"
##          Time         V1         V2         V3         V4          V5
## 2   0.7938642  0.1317100  0.5258382 -0.4143913 -0.3997963  0.70217527
## 3   0.8847711  0.9309564 -0.3851299 -0.3655774  0.2385870 -0.45239911
## 5  -1.9241193 -0.2128161  0.5472595  0.7330370 -0.1148105  0.21113849
## 7  -0.5321088 -0.4544119  0.1833689  1.5995685  0.1161120  0.04634072
## 13  0.8342533  0.8386033  0.1204831 -0.5899095  2.6304709  0.45016039
##             V6          V7           V8         V9         V10        V11
## 2  -0.65996511  0.77468106 -0.128018995 -0.1784619 -0.86920371 -0.7342596
## 3  -0.22888926 -0.37423865  0.082488530  1.0390051  0.01055984  0.4231858
## 5  -0.01847267  0.36481386  0.260136357 -0.3889529 -0.32241282  1.1531301
## 7   0.10868678 -0.06680613  0.326499077  0.1426745 -0.62806958  0.2808093
## 13  0.31974095  0.32633809 -0.001117488 -1.1987267  1.38071689  0.5551154
##           V12        V13         V14        V15       V16          V17
## 2  -0.8597611 -0.8200004 -0.78783645  0.5238678 0.5647907  0.396683247
## 3   0.5088843 -1.0893732  0.45737714  0.0980693 0.5853090 -0.972171149
## 5  -0.1570425 -1.3288297  0.05654065  0.6804164 0.5123328 -0.004735056
## 7   0.5250424 -0.5926783 -0.21898132 -1.1436442 0.5856412 -0.952697112
## 13  0.8643520  0.4576969  0.35062848 -2.0314462 0.9367574 -1.073576837
##           V18         V19         V20        V21        V22          V23
## 2   0.5554989 -0.83316850 -0.19866615  0.2754933  0.7810111 -0.283545336
## 3   0.5714500  0.43755314 -0.16801094 -0.2228802 -0.8136636  0.462568554
## 5   0.2249984 -0.03490229  0.03757060 -0.2969483 -0.8785076 -0.009712052
## 7   0.8498477 -0.32515944 -0.09061588  0.1009944  0.2667148 -0.296062393
## 13 -0.4457044 -1.26103240  0.04252027 -0.1381251 -0.8001834  0.333833597
##            V24         V25        V26         V27         V28      Amount Class
## 2   0.89267505 -0.26688768  1.2014236 -0.08625353  0.04143969 -0.30964971     0
## 3  -0.68535469 -0.94366563 -1.3831463  0.01129131 -0.10418028 -0.02546624     0
## 5  -0.65790428 -0.50681543  0.2259221  0.61058710  0.23896781 -0.31436744     0
## 7  -0.01999072  0.06949997 -1.1400439  0.30674777  0.35328760 -0.31328796     0
## 13  1.18596016 -0.37551862 -0.6966841 -0.16897848 -0.07705673  0.18399313     0

6) Logistic Regression

Logistic Regression

We are going to fit a Logistic Regression Model.How do we know which algorithms to try?

When selecting an algorithm or family of algorithms, you consider a few things:

The question that you are answering - in this case we are answering if a transaction is fraud or not fraud - this is a classification problem - classification problems solve discrete (yes/no or binary, or multiple classification outcome problems)

We also must consider the type of data which we have - in our case we have continuous (not mixed) predictor variables and no missing. Often times you might have missing data. Sometimes you will choose to delete or impute those values. Often times in fraud those missing values can be a signal. That is for another discussion. Some algorithms handle missing data better than other. However, we do not have any. So, we do not have to consider this.

Logistic regression is conceptually similar to linear regression, where linear regression estimates the target variable. Instead of predicting values, as in the linear regression, logistic regression would estimate the odds of a certain event occurring.

Since we are predicting fraud, with logistic regression we are actually estimating the odds of fraud occuring.

Another one we will try is SVM or Support Vector Machine. An SVM algorithm is a machine learning algorithm that uses supervised learning models to solve complex classification problems. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

Selecting vars that were correlated for new training set

subtraindown <-traindown %>% select(Class, V17, V14, V12, V10,V2, V4, V11, V19)

Logistic regression - including highly correlated vars

Training model

LR_Model <- glm(formula = Class ~ V17 + V14 + V12 + V10 + V2 + V4 + V11 + V19, data = subtraindown,family = "binomial")

Validate on trainData

Valid_trainData <- predict(LR_Model, newdata = subtraindown, type = "response") #prediction threshold
Valid_trainData <- ifelse(Valid_trainData > 0.5, 1, 0)  # set binary 

Produce Confusion Matrix

confusion_Mat<- confusionMatrix(as.factor(subtraindown$Class),as.factor(Valid_trainData))
print(confusion_Mat)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 388   6
##          1  30 364
##                                           
##                Accuracy : 0.9543          
##                  95% CI : (0.9373, 0.9678)
##     No Information Rate : 0.5305          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9086          
##                                           
##  Mcnemar's Test P-Value : 0.0001264       
##                                           
##             Sensitivity : 0.9282          
##             Specificity : 0.9838          
##          Pos Pred Value : 0.9848          
##          Neg Pred Value : 0.9239          
##              Prevalence : 0.5305          
##          Detection Rate : 0.4924          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9560          
##                                           
##        'Positive' Class : 0               
## 

Overall, this is a very high-performing model. In looking at the confusion matrix where there is room for improvment is that there is some misclassification of fraud as non-fraud predictions.

Validate on validData

testData_Class_predicted <- predict(LR_Model, newdata = test, type = "response")  
testData_Class_predicted  <- ifelse(testData_Class_predicted  > 0.5, 1, 0)  # set binary prediction threshold

conMat<- confusionMatrix(as.factor(test$Class),as.factor(testData_Class_predicted))

Regression_Acc_Test <-round(conMat$overall["Accuracy"]*100,2)
paste('Model Test Accuracy =', Regression_Acc_Test) 
## [1] "Model Test Accuracy = 96.77"

Produce Prediction on Test Data

Create ROC Curve

roc_object <- roc( test$Class, testData_Class_predicted)

Calculate Area Under the Curve

auc(roc_object)
## Area under the curve: 0.9278

The ROC curve helps us visualize the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate) for various threshold values. A perfect classifier would have an ROC curve that passes through the top-left corner of the plot (100% sensitivity and 100% specificity). The area under the ROC curve (AUC) is a scalar value that summarizes the performance of the classifier. An AUC of 1.0 indicates a perfect classifier, while an AUC of 0.5 suggests that the classifier is no better than random chance. Ours is 93%.

The other “good news” is that because our prediction on the train and the prediction on the test are so close, the likelihood of over-fitting is less likely.

SVM - or a Support Vector Machine

Fit the model using default parameters

SVM_model<- svm(Class ~ V17 + V14 + V12 + V10 + V2 + V4 + V11 + V19, data=subtraindown, kernel = 'radial', type="C-classification")
Valid_trainData <- predict(SVM_model, subtraindown) 

Produce Confusion Matrix

confusion_Mat<- confusionMatrix(as.factor(subtraindown$Class), as.factor(Valid_trainData))
print(confusion_Mat)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 393   1
##          1  40 354
##                                           
##                Accuracy : 0.948           
##                  95% CI : (0.9301, 0.9624)
##     No Information Rate : 0.5495          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8959          
##                                           
##  Mcnemar's Test P-Value : 2.946e-09       
##                                           
##             Sensitivity : 0.9076          
##             Specificity : 0.9972          
##          Pos Pred Value : 0.9975          
##          Neg Pred Value : 0.8985          
##              Prevalence : 0.5495          
##          Detection Rate : 0.4987          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9524          
##                                           
##        'Positive' Class : 0               
## 

Output Accuracy

AVM_Acc_Train <- round(confusion_Mat$overall["Accuracy"]*100,4)
paste('Model Train Accuracy =', AVM_Acc_Train)
## [1] "Model Train Accuracy = 94.797"

Again we have a highly accurate model with very few misclassifications, and those that are misclassified are fraud classified as non-fraud.

Predict on test data

Test_Fraud_predicted <- predict(SVM_model, test) #produce confusion matrix 

conMat<- confusionMatrix(as.factor(test$Class), as.factor(Test_Fraud_predicted))

Output Accuracy

AVM_Acc_Test <- round(conMat$overall["Accuracy"]*100,4)
paste('Model Test Accuracy =', AVM_Acc_Test) 
## [1] "Model Test Accuracy = 98.6148"

Prediction accuracy on Test

SVM_Acc <- c(AVM_Acc_Train, AVM_Acc_Test)

Both Train and Test Accuracy for SVM

SVM_Acc
## Accuracy Accuracy 
##  94.7970  98.6148

References:

Analytics Vidya (2024) “Practical Guide for Dealing with Imbalanced Classification Problems in R” Analytics Vidya

Kuhn, M. (2023). “C5.0 Decision Trees and Rule-Based Models” CranProject.org

Husejinovic, A.(2020) “Credit card fraud detection using naive Bayesian and C4.5 decision tree classifiers” Periodicals of Engineering & Natural Sciences.

Gandi,R.(2018). “Support Vector Machine. Introduction to Machine Learning Algorithms” SVM from Scratch