##This is a demo in RStats of predicting credit card fraud in a dataset. We will be going through the outlined roadmap below, which includes several steps.
##Of note, we are using a publicly available, anonymized dataset “creditcard.csv.”
#The dataset has already gone through PCA (Principle Component Analysis). I have already posted a tutorial on PCA, which you can follow. Here we will discuss how it impacts the analysis. Additionally, the target, Independent Variable (IV) is labeled “Class,” and is the Fraud target. It is already labeled in the set as ‘0’ or ‘1.’ We will discuss how this would be deployed in an enterprise environment.
##Roadmap ##1) EDA’s - Exploratory Data Analysis of our data ##2) Preprocessing/Cleaning the Data ##3) Splitting the data into train/test ##4) Sampling/Under-Sampling ##5) Correlation Matrices ##6) Contrasting Decision Trees with Logistic Regression
##References:
##Analytics Vidya (2024) “Practical Guide for Dealing with Imbalanced Classification Problems in R” Analytics Vidya
##Daras, D. (2024)“PCA R FakeBills” Github. Fake Bills
##Kuhn, M. (2023). “C5.0 Decision Trees and Rule-Based Models” CranProject.org
##Husejinovic, A.(2020) “Credit card fraud detection using naive Bayesian and C4.5 decision tree classifiers” Periodicals of Engineering & Natural Sciences.
##Gandi,R.(2018). “Support Vector Machine. Introduction to Machine Learning Algorithms” SVM from Scratch
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: lattice
## corrplot 0.92 loaded
## Rborist 0.3-7
## Type RboristNews() to see new features/changes/bug fixes.
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## Loaded ROSE 0.0-4
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
##
## Attaching package: 'lares'
## The following object is masked from 'package:e1071':
##
## impute
df <- read.csv("/cloud/project/creditcard.csv", header=TRUE, stringsAsFactors=FALSE)
head(df,1)
## Time V1 V2 V3 V4 V5 V6 V7
## 1 0 -1.359807 -0.07278117 2.536347 1.378155 -0.3383208 0.4623878 0.2395986
## V8 V9 V10 V11 V12 V13 V14
## 1 0.0986979 0.363787 0.09079417 -0.5515995 -0.6178009 -0.9913898 -0.3111694
## V15 V16 V17 V18 V19 V20 V21
## 1 1.468177 -0.4704005 0.2079712 0.02579058 0.403993 0.2514121 -0.01830678
## V22 V23 V24 V25 V26 V27 V28
## 1 0.2778376 -0.1104739 0.06692807 0.1285394 -0.1891148 0.1335584 -0.02105305
## Amount Class
## 1 149.62 0
#EDAs - Exploratory Data Analysis
##Because the variables have already been normalized, the names are labeled as V1, V2, V3, etc and they are expressed on the same scale, which is why the means are all “0.” This means that when we run our correlation, they will be largely uncorrelated. The only variables not scaled are Amount and Time, which we will do later.
##When we run our summary we see that our variable for Amount has at least one, if not more, outliers, which could impact our analysis.
df$Class<-as.factor(df$Class)
prop.table(table(df$Class))
##
## 0 1
## 0.998272514 0.001727486
counts <- table(df$Class)
barplot(counts, main="Fraud Distribution",
xlab="Counts of Fraud", col="hot pink")
##We will have to address this. This imbalance in the dataset will lead to imbalance in u might see a high accuracy number, but we should not befooled by that number. Because the sensitivity is low. Therefore, we have to address this imbalance in the dataset before going further.
hist(df$Amount, col = 'pink', border = "white")
hist(df$Time, col = 'purple', border = "white")
a <- colnames(df)[2:29]
par(mfrow = c(1, 4))
for (i in 1:length(a)){
sub = df[a[i]][,1]
hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
}
#Preprocessing/Cleaning the Data
##Now we are going to scale the variables Time and Amount, as the other variables are. We are using a Z Score formula to set the means to “0” and standardize the variables.
# applying scale function
df[1 : 30] <- as.data.frame(scale(df[1 : 30]))
# displaying result
head(df,5)
## Time V1 V2 V3 V4 V5
## 1 -1.996580 -0.6942411 -0.04407485 1.6727706 0.9733638 -0.245116153
## 2 -1.996580 0.6084953 0.16117564 0.1097969 0.3165224 0.043483276
## 3 -1.996558 -0.6934992 -0.81157640 1.1694664 0.2682308 -0.364571146
## 4 -1.996558 -0.4933240 -0.11216923 1.1825144 -0.6097256 -0.007468867
## 5 -1.996537 -0.5913287 0.53154012 1.0214099 0.2846549 -0.295014918
## V6 V7 V8 V9 V10 V11
## 1 0.34706734 0.1936786 0.08263713 0.3311272 0.08338540 -0.5404061
## 2 -0.06181986 -0.0637001 0.07125336 -0.2324938 -0.15334936 1.5800001
## 3 1.35145121 0.6397745 0.20737237 -1.3786729 0.19069928 0.6118286
## 4 0.93614819 0.1920703 0.31601704 -1.2625010 -0.05046786 -0.2218912
## 5 0.07199846 0.4793014 -0.22650983 0.7443250 0.69162382 -0.8061452
## V12 V13 V14 V15 V16 V17 V18
## 1 -0.6182946 -0.9960972 -0.3246096 1.6040110 -0.5368319 0.2448630 0.03076988
## 2 1.0660867 0.4914173 -0.1499822 0.6943592 0.5294328 -0.1351697 -0.21876220
## 3 0.0661365 0.7206986 -0.1731136 2.5629017 -3.2982296 1.3068656 -0.14478974
## 4 0.1783707 0.5101678 -0.3003600 -0.6898362 -1.2092939 -0.8054432 2.34530040
## 5 0.5386257 1.3522420 -1.1680315 0.1913231 -0.5152042 -0.2790803 -0.04556892
## V19 V20 V21 V22 V23 V24
## 1 0.4962812 0.32611744 -0.02492332 0.382853766 -0.1769110 0.1105067
## 2 -0.1790857 -0.08961071 -0.30737626 -0.880075209 0.1622009 -0.5611296
## 3 -2.7785560 0.68097378 0.33763110 1.063356404 1.4563172 -1.1380901
## 4 -1.5142023 -0.26985475 -0.14744304 0.007266895 -0.3047760 -1.9410237
## 5 0.9870356 0.52993786 -0.01283920 1.100009340 -0.2201230 0.2332497
## V25 V26 V27 V28 Amount Class
## 1 0.2465850 -0.3921697 0.33089104 -0.06378104 0.24496383 0
## 2 0.3206933 0.2610690 -0.02225564 0.04460744 -0.34247394 0
## 3 -0.6285356 -0.2884462 -0.13713661 -0.18102051 1.16068389 0
## 4 1.2419015 -0.4602165 0.15539593 0.18618826 0.14053401 0
## 5 -0.3952009 1.0416095 0.54361884 0.65181477 -0.07340321 0
#3 Splitting Data into Train/Test Samples
##Why are we splitting our data? And why, in our next step are we adjusting the training data so that the “Class” variable, which represents Fraud, will be even in the training set, but will not be altered in the testing set? One answer: “Bayesian Statistics and machine learning.”
##In Fisher Statistics, we set up a null hypothesis to be disproved according to a p-value. Fisher is based upon lab conditions where variables are controlled and manipulated.
##However,the goal of Bayesian ML is to estimate the likelihood is something that can be estimated from the training data. In machine learning, the model is essentially learning the parameters of the target from the training data. The idea behind ML is that models are built upon the parameters of real-world estimations.
##When training a regular machine learning model, it’s an iterative process which updates the model’s parameters in an attempt to maximize the probability of seeing the training data having already seen the model parameters.
set.seed(1234)
index<-createDataPartition(df$Class,p=0.8,list=FALSE)
train<-df[index,]
test<-df [-index,]
# Get the count and proportion of each classes
prop.table(table(train$Class))
##
## 0 1
## 0.998270762 0.001729238
table(train$Class)
##
## 0 1
## 227452 394
#4) Sampling/Under-Sampling/SMOTE
##There are a few ways to balance the dataset on the Class variable.
##Upsampling - this method increases the size of the minority class by sampling with replacement so that the classes will have the same size. An advantage of using this method is that it leads to no information loss. The disadvantage of using this method is that, since oversampling simply adds replicated observations in original data set, it ends up adding multiple observations of several types, thus it often leads to overfitting.
##Downsampling - in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample. The problem with this method is that it decreases the size of the overall sample in the training set, which can be a problem sfor some algorithms.
##Hybrid Methods - These include ROSE and SMOTE. ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and create new artificial points in the minority class.
##We will be using downsampling. Generally, downsampling is recommended over oversampling, as downsampling gives a more accurate representation of the cases within the dataset you are trying to detect/train for.
ggplot(train,aes(x=Amount,y=Time,group=Class))+
geom_point(aes(color=Class))+
scale_color_brewer(palette="Accent")
##Now we are going to under-sample to balance our training set
set.seed(111)
traindown<-downSample(x=train[,-ncol(train)],
y=train$Class)
table(traindown$Class)
##
## 0 1
## 394 394
##Now we will check the distribution again
ggplot(traindown,aes(x=Amount,y=Time,group=Class))+
geom_point(aes(color=Class))+
scale_color_brewer(palette="Accent")
##As we can see, in the undersampled, it appears more randomly distributed.
##5) Correlation Matrices
##We are going to look at the correlation of the variables with regard particularly to fraud in the train sample which has been balanced (under sampled). Otherwise the sample will be impacted by the imbalance in the classes.
##Evaluating a Correlation Matrix ***In reading a correlation matrix < +/-.29 is considered a LOW correlation
##between +/- .30 and .49 is a MEDIUM correlation and
##between +/- .50 and 1.00 is a HIGH correlation
##Pearsons Correlation Test #Correlation Matrix for All Quantitative Variables #drop variables that are not quantitative or are categorical #We are also saving the correlation matrix itself as a dataframe for a visualization
str(traindown)
## 'data.frame': 788 obs. of 31 variables:
## $ Time : num -0.22 0.794 0.885 0.548 -1.924 ...
## $ V1 : num -0.745 0.132 0.931 -0.782 -0.213 ...
## $ V2 : num -0.3798 0.5258 -0.3851 -0.0166 0.5473 ...
## $ V3 : num 1.482 -0.414 -0.366 1.463 0.733 ...
## $ V4 : num -0.581 -0.4 0.239 -1.538 -0.115 ...
## $ V5 : num 0.0479 0.7022 -0.4524 -0.6077 0.2111 ...
## $ V6 : num 0.3126 -0.66 -0.2289 0.0536 -0.0185 ...
## $ V7 : num -0.395 0.775 -0.374 -0.144 0.365 ...
## $ V8 : num 0.4149 -0.128 0.0825 0.0784 0.2601 ...
## $ V9 : num 0.82 -0.178 1.039 -0.217 -0.389 ...
## $ V10 : num -1.0365 -0.8692 0.0106 0.1761 -0.3224 ...
## $ V11 : num 1.558 -0.734 0.423 0.464 1.153 ...
## $ V12 : num 0.907 -0.8598 0.5089 0.0321 -0.157 ...
## $ V13 : num -0.667 -0.82 -1.089 -0.264 -1.329 ...
## $ V14 : num 0.1267 -0.7878 0.4574 -1.1714 0.0565 ...
## $ V15 : num 1.7574 0.5239 0.0981 -2.3711 0.6804 ...
## $ V16 : num -1.465 0.565 0.585 1.255 0.512 ...
## $ V17 : num 0.68909 0.39668 -0.97217 0.01392 -0.00474 ...
## $ V18 : num 0.195 0.555 0.571 -1.308 0.225 ...
## $ V19 : num 1.7199 -0.8332 0.4376 0.8007 -0.0349 ...
## $ V20 : num -0.172 -0.1987 -0.168 -0.108 0.0376 ...
## $ V21 : num 0.1404 0.2755 -0.2229 -0.0167 -0.2969 ...
## $ V22 : num 1.028 0.781 -0.814 0.138 -0.879 ...
## $ V23 : num 0.83108 -0.28355 0.46257 -0.52451 -0.00971 ...
## $ V24 : num -0.467 0.893 -0.685 0.134 -0.658 ...
## $ V25 : num 0.524 -0.267 -0.944 0.704 -0.507 ...
## $ V26 : num -0.916 1.201 -1.383 -0.947 0.226 ...
## $ V27 : num 0.419 -0.0863 0.0113 -2.3522 0.6106 ...
## $ V28 : num -0.0589 0.0414 -0.1042 -1.2205 0.239 ...
## $ Amount: num -0.3492 -0.3096 -0.0255 -0.2293 -0.3144 ...
## $ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#convert column 'Class' from factor to numeric
traindown$Class <- as.numeric(as.character(traindown$Class))
class(traindown$Class)
## [1] "numeric"
cormat <- round(cor(traindown),
digits = 2 # rounded to 2 decimals
)
print(cormat)
## Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## Time 1.00 0.25 -0.25 0.16 -0.22 0.30 0.10 0.23 -0.15 0.19 0.23 -0.33
## V1 0.25 1.00 -0.80 0.87 -0.62 0.86 0.30 0.87 -0.11 0.66 0.74 -0.52
## V2 -0.25 -0.80 1.00 -0.85 0.67 -0.77 -0.33 -0.84 0.02 -0.71 -0.77 0.62
## V3 0.16 0.87 -0.85 1.00 -0.77 0.84 0.46 0.88 -0.21 0.76 0.85 -0.71
## V4 -0.22 -0.62 0.67 -0.77 1.00 -0.59 -0.45 -0.71 0.14 -0.80 -0.80 0.80
## V5 0.30 0.86 -0.77 0.84 -0.59 1.00 0.30 0.83 -0.24 0.66 0.75 -0.53
## V6 0.10 0.30 -0.33 0.46 -0.45 0.30 1.00 0.32 -0.54 0.40 0.44 -0.52
## V7 0.23 0.87 -0.84 0.88 -0.71 0.83 0.32 1.00 0.05 0.76 0.86 -0.64
## V8 -0.15 -0.11 0.02 -0.21 0.14 -0.24 -0.54 0.05 1.00 -0.10 -0.08 0.22
## V9 0.19 0.66 -0.71 0.76 -0.80 0.66 0.40 0.76 -0.10 1.00 0.86 -0.71
## V10 0.23 0.74 -0.77 0.85 -0.80 0.75 0.44 0.86 -0.08 0.86 1.00 -0.81
## V11 -0.33 -0.52 0.62 -0.71 0.80 -0.53 -0.52 -0.64 0.22 -0.71 -0.81 1.00
## V12 0.28 0.59 -0.67 0.76 -0.84 0.61 0.52 0.72 -0.20 0.77 0.88 -0.91
## V13 -0.11 -0.06 0.06 -0.07 0.04 -0.12 -0.13 -0.02 0.27 -0.04 -0.05 0.08
## V14 0.20 0.44 -0.58 0.66 -0.81 0.44 0.56 0.55 -0.24 0.70 0.77 -0.90
## V15 -0.13 0.10 -0.14 0.12 -0.10 0.08 -0.08 0.15 0.14 0.09 0.13 -0.03
## V16 0.26 0.63 -0.63 0.72 -0.73 0.69 0.45 0.75 -0.22 0.74 0.86 -0.81
## V17 0.27 0.67 -0.64 0.73 -0.71 0.74 0.44 0.76 -0.26 0.76 0.86 -0.78
## V18 0.29 0.67 -0.62 0.69 -0.64 0.74 0.37 0.76 -0.22 0.72 0.80 -0.68
## V19 -0.08 -0.32 0.27 -0.33 0.33 -0.41 -0.25 -0.38 0.23 -0.36 -0.44 0.42
## V20 -0.04 -0.32 0.26 -0.35 0.28 -0.31 -0.05 -0.37 -0.11 -0.37 -0.38 0.19
## V21 -0.07 0.04 -0.01 0.04 0.00 0.05 -0.07 0.10 0.06 0.15 0.09 0.15
## V22 0.15 -0.04 0.02 -0.06 0.10 -0.09 0.04 -0.15 -0.11 -0.22 -0.21 0.02
## V23 0.07 -0.04 0.10 -0.03 0.02 -0.09 0.32 -0.04 -0.36 -0.06 -0.04 -0.05
## V24 -0.03 -0.04 0.01 0.03 -0.11 -0.11 -0.02 -0.04 0.06 0.04 0.02 -0.12
## V25 -0.17 -0.07 0.12 -0.08 -0.03 -0.09 -0.10 0.07 0.23 -0.01 0.03 0.00
## V26 -0.05 0.03 0.03 -0.04 0.15 0.03 -0.09 0.00 0.04 -0.14 -0.06 0.17
## V27 -0.16 0.18 -0.10 0.07 0.01 0.15 -0.20 0.17 0.31 0.06 0.07 0.21
## V28 0.02 0.15 0.01 0.09 -0.04 0.15 -0.08 0.13 0.02 0.11 0.10 0.05
## Amount 0.02 -0.05 -0.27 -0.01 0.02 -0.12 0.21 0.14 0.02 0.02 0.00 -0.02
## Class -0.17 -0.44 0.49 -0.57 0.71 -0.36 -0.44 -0.48 0.09 -0.57 -0.63 0.70
## V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
## Time 0.28 -0.11 0.20 -0.13 0.26 0.27 0.29 -0.08 -0.04 -0.07 0.15 0.07
## V1 0.59 -0.06 0.44 0.10 0.63 0.67 0.67 -0.32 -0.32 0.04 -0.04 -0.04
## V2 -0.67 0.06 -0.58 -0.14 -0.63 -0.64 -0.62 0.27 0.26 -0.01 0.02 0.10
## V3 0.76 -0.07 0.66 0.12 0.72 0.73 0.69 -0.33 -0.35 0.04 -0.06 -0.03
## V4 -0.84 0.04 -0.81 -0.10 -0.73 -0.71 -0.64 0.33 0.28 0.00 0.10 0.02
## V5 0.61 -0.12 0.44 0.08 0.69 0.74 0.74 -0.41 -0.31 0.05 -0.09 -0.09
## V6 0.52 -0.13 0.56 -0.08 0.45 0.44 0.37 -0.25 -0.05 -0.07 0.04 0.32
## V7 0.72 -0.02 0.55 0.15 0.75 0.76 0.76 -0.38 -0.37 0.10 -0.15 -0.04
## V8 -0.20 0.27 -0.24 0.14 -0.22 -0.26 -0.22 0.23 -0.11 0.06 -0.11 -0.36
## V9 0.77 -0.04 0.70 0.09 0.74 0.76 0.72 -0.36 -0.37 0.15 -0.22 -0.06
## V10 0.88 -0.05 0.77 0.13 0.86 0.86 0.80 -0.44 -0.38 0.09 -0.21 -0.04
## V11 -0.91 0.08 -0.90 -0.03 -0.81 -0.78 -0.68 0.42 0.19 0.15 0.02 -0.05
## V12 1.00 -0.09 0.89 0.04 0.90 0.88 0.80 -0.48 -0.21 -0.10 -0.09 0.02
## V13 -0.09 1.00 -0.02 -0.02 -0.11 -0.14 -0.14 0.19 -0.07 0.06 -0.01 -0.10
## V14 0.89 -0.02 1.00 -0.03 0.78 0.74 0.63 -0.38 -0.13 -0.23 0.07 0.02
## V15 0.04 -0.02 -0.03 1.00 -0.02 0.00 0.00 0.22 -0.14 0.16 -0.11 -0.06
## V16 0.90 -0.11 0.78 -0.02 1.00 0.95 0.91 -0.64 -0.19 -0.15 -0.11 0.03
## V17 0.88 -0.14 0.74 0.00 0.95 1.00 0.94 -0.61 -0.21 -0.11 -0.12 0.03
## V18 0.80 -0.14 0.63 0.00 0.91 0.94 1.00 -0.57 -0.18 -0.08 -0.12 0.04
## V19 -0.48 0.19 -0.38 0.22 -0.64 -0.61 -0.57 1.00 0.04 0.12 0.13 -0.03
## V20 -0.21 -0.07 -0.13 -0.14 -0.19 -0.21 -0.18 0.04 1.00 -0.49 0.38 0.21
## V21 -0.10 0.06 -0.23 0.16 -0.15 -0.11 -0.08 0.12 -0.49 1.00 -0.70 0.05
## V22 -0.09 -0.01 0.07 -0.11 -0.11 -0.12 -0.12 0.13 0.38 -0.70 1.00 0.08
## V23 0.02 -0.10 0.02 -0.06 0.03 0.03 0.04 -0.03 0.21 0.05 0.08 1.00
## V24 0.05 0.09 0.14 0.00 -0.04 -0.07 -0.11 0.09 -0.07 -0.03 0.05 -0.04
## V25 0.04 0.04 -0.05 -0.03 0.08 0.04 0.07 -0.17 0.04 0.13 -0.22 0.16
## V26 -0.15 0.03 -0.21 0.06 -0.09 -0.07 -0.04 0.07 0.02 0.03 0.03 0.02
## V27 -0.09 0.04 -0.23 0.15 -0.08 -0.06 -0.01 0.07 -0.08 0.37 -0.34 -0.20
## V28 -0.04 -0.13 -0.15 0.12 -0.01 0.03 0.08 -0.03 0.05 0.27 -0.23 0.03
## Amount 0.02 -0.02 0.05 0.04 0.00 -0.02 0.00 -0.03 0.27 0.05 -0.04 -0.06
## Class -0.69 -0.06 -0.76 0.00 -0.59 -0.55 -0.46 0.27 0.16 0.12 -0.01 -0.03
## V24 V25 V26 V27 V28 Amount Class
## Time -0.03 -0.17 -0.05 -0.16 0.02 0.02 -0.17
## V1 -0.04 -0.07 0.03 0.18 0.15 -0.05 -0.44
## V2 0.01 0.12 0.03 -0.10 0.01 -0.27 0.49
## V3 0.03 -0.08 -0.04 0.07 0.09 -0.01 -0.57
## V4 -0.11 -0.03 0.15 0.01 -0.04 0.02 0.71
## V5 -0.11 -0.09 0.03 0.15 0.15 -0.12 -0.36
## V6 -0.02 -0.10 -0.09 -0.20 -0.08 0.21 -0.44
## V7 -0.04 0.07 0.00 0.17 0.13 0.14 -0.48
## V8 0.06 0.23 0.04 0.31 0.02 0.02 0.09
## V9 0.04 -0.01 -0.14 0.06 0.11 0.02 -0.57
## V10 0.02 0.03 -0.06 0.07 0.10 0.00 -0.63
## V11 -0.12 0.00 0.17 0.21 0.05 -0.02 0.70
## V12 0.05 0.04 -0.15 -0.09 -0.04 0.02 -0.69
## V13 0.09 0.04 0.03 0.04 -0.13 -0.02 -0.06
## V14 0.14 -0.05 -0.21 -0.23 -0.15 0.05 -0.76
## V15 0.00 -0.03 0.06 0.15 0.12 0.04 0.00
## V16 -0.04 0.08 -0.09 -0.08 -0.01 0.00 -0.59
## V17 -0.07 0.04 -0.07 -0.06 0.03 -0.02 -0.55
## V18 -0.11 0.07 -0.04 -0.01 0.08 0.00 -0.46
## V19 0.09 -0.17 0.07 0.07 -0.03 -0.03 0.27
## V20 -0.07 0.04 0.02 -0.08 0.05 0.27 0.16
## V21 -0.03 0.13 0.03 0.37 0.27 0.05 0.12
## V22 0.05 -0.22 0.03 -0.34 -0.23 -0.04 -0.01
## V23 -0.04 0.16 0.02 -0.20 0.03 -0.06 -0.03
## V24 1.00 -0.05 -0.12 -0.15 -0.07 -0.05 -0.13
## V25 -0.05 1.00 0.06 0.17 0.15 -0.08 0.02
## V26 -0.12 0.06 1.00 0.17 0.03 -0.06 0.06
## V27 -0.15 0.17 0.17 1.00 0.24 0.03 0.12
## V28 -0.07 0.15 0.03 0.24 1.00 -0.06 0.09
## Amount -0.05 -0.08 -0.06 0.03 -0.06 1.00 0.04
## Class -0.13 0.02 0.06 0.12 0.09 0.04 1.00
M <-cor(traindown)
options(repr.plot.width=35, repr.plot.height=35)
corrplot(M, type="upper", order = "hclust", col=brewer.pal(n=8, name="PiYG"))
corr_cross(traindown, # name of dataset
max_pvalue = 0.05, # display only significant correlations (at 5% level)
top = 10 # display top 10 couples of variables (by correlation coefficient)
)
## Returning only the top 10. You may override with the 'top' argument
## Warning in .font_global(font, quiet = FALSE, ...): Font(s) 'Arial Narrow' not
## installed, with other name, or can't be found
##V17, V14, V12 and V10 are negatively correlated. Notice how the lower these values are, the more likely the end result will be a fraud transaction.
##V2, V4, V11, and V19 are positively correlated. Notice how the higher these values are, the more likely the end result will be a fraud transactions
##Leaving Class as numeric for Clustering
# K-means clustering
# +++++++++++++++++++++
km.res <- kmeans(traindown, 3, nstart = 10)
# Visualize kmeans clustering
# use repel = TRUE to avoid overplotting
fviz_cluster(km.res, traindown[, -5], ellipse.type = "norm")
## Although the subsample is pretty small, the Kmeans algorithm is able
to detect clusters pretty accurately
##This gives us an indication that further predictive models will perform pretty well in separating fraud cases from non-fraud cases.
##We will be changing Class variable back to factor because the algorithms we wish to use need to have a target variable that is a factor variable (Logistic, Decision Trees)
traindown$Class <- as.factor(traindown$Class)
##In many instances of modeling, we would want to detect the outliers and remove those cases, such as in psychological studies. However, with several types of verticals, such as sensor fault detection, fraud detection, and disaster risk warning systems it’s the outliers or anomalies that are of most interest, as they often indicate the unusual situation we are trying to detect. Again, this will also inform the algorithm choice. However, we will run an outlier detection routine to view the outliers in our dataset.
# create detect outlier function
detect_outlier <- function(x) {
# calculate first quantile
Quantile1 <- quantile(x, probs=.25)
# calculate third quantile
Quantile3 <- quantile(x, probs=.75)
# calculate interquartile range
IQR = Quantile3 - Quantile1
# return true or false
x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}
# create remove outlier function
remove_outlier <- function(traindown, columns = names(dataframe)) {
# for loop to traverse in columns vector
for (col in columns) {
# remove observation if it satisfies outlier function
traindown <- traindown[!detect_outlier(traindown[[col]]), ]
}
# return dataframe
print("Remove outliers")
head(traindown,5)
}
remove_outlier(traindown,c('V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24','V25','V26','V27','V28','Time','Amount'))
## [1] "Remove outliers"
## Time V1 V2 V3 V4 V5
## 2 0.7938642 0.1317100 0.5258382 -0.4143913 -0.3997963 0.70217527
## 3 0.8847711 0.9309564 -0.3851299 -0.3655774 0.2385870 -0.45239911
## 5 -1.9241193 -0.2128161 0.5472595 0.7330370 -0.1148105 0.21113849
## 7 -0.5321088 -0.4544119 0.1833689 1.5995685 0.1161120 0.04634072
## 13 0.8342533 0.8386033 0.1204831 -0.5899095 2.6304709 0.45016039
## V6 V7 V8 V9 V10 V11
## 2 -0.65996511 0.77468106 -0.128018995 -0.1784619 -0.86920371 -0.7342596
## 3 -0.22888926 -0.37423865 0.082488530 1.0390051 0.01055984 0.4231858
## 5 -0.01847267 0.36481386 0.260136357 -0.3889529 -0.32241282 1.1531301
## 7 0.10868678 -0.06680613 0.326499077 0.1426745 -0.62806958 0.2808093
## 13 0.31974095 0.32633809 -0.001117488 -1.1987267 1.38071689 0.5551154
## V12 V13 V14 V15 V16 V17
## 2 -0.8597611 -0.8200004 -0.78783645 0.5238678 0.5647907 0.396683247
## 3 0.5088843 -1.0893732 0.45737714 0.0980693 0.5853090 -0.972171149
## 5 -0.1570425 -1.3288297 0.05654065 0.6804164 0.5123328 -0.004735056
## 7 0.5250424 -0.5926783 -0.21898132 -1.1436442 0.5856412 -0.952697112
## 13 0.8643520 0.4576969 0.35062848 -2.0314462 0.9367574 -1.073576837
## V18 V19 V20 V21 V22 V23
## 2 0.5554989 -0.83316850 -0.19866615 0.2754933 0.7810111 -0.283545336
## 3 0.5714500 0.43755314 -0.16801094 -0.2228802 -0.8136636 0.462568554
## 5 0.2249984 -0.03490229 0.03757060 -0.2969483 -0.8785076 -0.009712052
## 7 0.8498477 -0.32515944 -0.09061588 0.1009944 0.2667148 -0.296062393
## 13 -0.4457044 -1.26103240 0.04252027 -0.1381251 -0.8001834 0.333833597
## V24 V25 V26 V27 V28 Amount Class
## 2 0.89267505 -0.26688768 1.2014236 -0.08625353 0.04143969 -0.30964971 0
## 3 -0.68535469 -0.94366563 -1.3831463 0.01129131 -0.10418028 -0.02546624 0
## 5 -0.65790428 -0.50681543 0.2259221 0.61058710 0.23896781 -0.31436744 0
## 7 -0.01999072 0.06949997 -1.1400439 0.30674777 0.35328760 -0.31328796 0
## 13 1.18596016 -0.37551862 -0.6966841 -0.16897848 -0.07705673 0.18399313 0
##6) Logistic Regression
##We are going to fit a Logistic Regression Model.How do we know which algorithms to try?
##When selecting an algorithm or family of algorithms, you consider a few things:
##The question that you are answering - in this case we are answering if a transaction is fraud or not fraud - this is a classification problem - classification problems solve discrete (yes/no or binary, or multiple classification outcome problems)
##We also must consider the type of data which we have - in our case we have continuous (not mixed) predictor variables and no missing. Often times you might have missing data. Sometimes you will choose to delete or impute those values. Often times in fraud those missing values can be a signal. That is for another discussion. Some algorithms handle missing data better than other. However, we do not have any. So, we do not have to consider this.
##Logistic regression is conceptually similar to linear regression, where linear regression estimates the target variable. Instead of predicting values, as in the linear regression, logistic regression would estimate the odds of a certain event occurring.
##Since we are predicting fraud, with logistic regression we are actually estimating the odds of fraud occuring.
subtraindown <-traindown %>% select(Class, V17, V14, V12, V10,V2, V4, V11, V19)
# Training model
LR_Model <- glm(formula = Class ~ V17 + V14 + V12 + V10 + V2 + V4 + V11 + V19, data = subtraindown,family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
### Validate on trainData
Valid_trainData <- predict(LR_Model, newdata = subtraindown, type = "response") #prediction threshold
Valid_trainData <- ifelse(Valid_trainData > 0.5, 1, 0) # set binary
#produce confusion matrix
confusion_Mat<- confusionMatrix(as.factor(subtraindown$Class),as.factor(Valid_trainData))
print(confusion_Mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 388 6
## 1 30 364
##
## Accuracy : 0.9543
## 95% CI : (0.9373, 0.9678)
## No Information Rate : 0.5305
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9086
##
## Mcnemar's Test P-Value : 0.0001264
##
## Sensitivity : 0.9282
## Specificity : 0.9838
## Pos Pred Value : 0.9848
## Neg Pred Value : 0.9239
## Prevalence : 0.5305
## Detection Rate : 0.4924
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9560
##
## 'Positive' Class : 0
##
##Overall, this is a very high-performing model. In looking at the confusion matrix where there is room for improvment is that there is some misclassification of fraud as non-fraud predictions.
### Validate on validData
testData_Class_predicted <- predict(LR_Model, newdata = test, type = "response")
testData_Class_predicted <- ifelse(testData_Class_predicted > 0.5, 1, 0) # set binary prediction threshold
conMat<- confusionMatrix(as.factor(test$Class),as.factor(testData_Class_predicted))
Regression_Acc_Test <-round(conMat$overall["Accuracy"]*100,2)
paste('Model Test Accuracy =', Regression_Acc_Test)
## [1] "Model Test Accuracy = 96.77"
# create roc curve
roc_object <- roc( test$Class, testData_Class_predicted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# calculate area under curve
auc(roc_object)
## Area under the curve: 0.9278
##The ROC curve helps us visualize the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate) for various threshold values. A perfect classifier would have an ROC curve that passes through the top-left corner of the plot (100% sensitivity and 100% specificity). The area under the ROC curve (AUC) is a scalar value that summarizes the performance of the classifier. An AUC of 1.0 indicates a perfect classifier, while an AUC of 0.5 suggests that the classifier is no better than random chance. Ours is 93%.
##The other “good news” is that because our prediction on the train and the prediction on the test are so close, the likelihood of over-fitting is less likely.
##SVM - or a Support Vector Machine is a Machine Learning algorithm. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
# fit the model using default parameters
SVM_model<- svm(Class ~ V17 + V14 + V12 + V10 + V2 + V4 + V11 + V19, data=subtraindown, kernel = 'radial', type="C-classification")
Valid_trainData <- predict(SVM_model, subtraindown)
#produce confusion matrix
confusion_Mat<- confusionMatrix(as.factor(subtraindown$Class), as.factor(Valid_trainData))
print(confusion_Mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 393 1
## 1 40 354
##
## Accuracy : 0.948
## 95% CI : (0.9301, 0.9624)
## No Information Rate : 0.5495
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8959
##
## Mcnemar's Test P-Value : 2.946e-09
##
## Sensitivity : 0.9076
## Specificity : 0.9972
## Pos Pred Value : 0.9975
## Neg Pred Value : 0.8985
## Prevalence : 0.5495
## Detection Rate : 0.4987
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9524
##
## 'Positive' Class : 0
##
# output accuracy
AVM_Acc_Train <- round(confusion_Mat$overall["Accuracy"]*100,4)
paste('Model Train Accuracy =', AVM_Acc_Train)
## [1] "Model Train Accuracy = 94.797"
##Again we have a highly accurate model with very few misclassifications, and those that are misclassified are fraud classified as non-fraud.
### Test on Test Data
Test_Fraud_predicted <- predict(SVM_model, test) #produce confusion matrix
conMat<- confusionMatrix(as.factor(test$Class), as.factor(Test_Fraud_predicted))
# output accuracy
AVM_Acc_Test <- round(conMat$overall["Accuracy"]*100,4)
paste('Model Test Accuracy =', AVM_Acc_Test)
## [1] "Model Test Accuracy = 98.6148"
# prediction accuracy on test
SVM_Acc <- c(AVM_Acc_Train, AVM_Acc_Test)
SVM_Acc
## Accuracy Accuracy
## 94.7970 98.6148
embed_url("https://youtu.be/ido6NrjGi2o?si=u0rlXAEfER8ao6aT")