In this document we will be predicting fraudulent transactions from the ‘Credit Card Fraud Detection’ dataset availble from Kaggle. After a short EDA, I will use the keras package to build an Artificial Neural Network (ANN) and compare the results to a traditional logistic regression.
In this dataset we have a collection of data from credit card users in European countries spanning two days in September of 2013. This dataset contains information on 284,807 transactions; 492 of which have been marked as fraudulent attempts. Since only 0.17% of transactions are fraudulent, this becomes an unbalanced dataset.
The components of this dataset are all numerical values which are the result of a principle component analysis transformation. Due to confidentiality laws, the original features are not available.
library(tidyverse)
library(moments)
library(keras)
library(yardstick)
library(caret)
ccdata <- read_csv('datasets/creditcarddata.csv')
glimpse(ccdata)
## Observations: 284,807
## Variables: 31
## $ Time <int> 0, 0, 1, 1, 2, 2, 4, 7, 7, 9, 10, 10, 10, 11, 12, 12, 1...
## $ V1 <dbl> -1.3598071, 1.1918571, -1.3583541, -0.9662717, -1.15823...
## $ V2 <dbl> -0.07278117, 0.26615071, -1.34016307, -0.18522601, 0.87...
## $ V3 <dbl> 2.53634674, 0.16648011, 1.77320934, 1.79299334, 1.54871...
## $ V4 <dbl> 1.37815522, 0.44815408, 0.37977959, -0.86329128, 0.4030...
## $ V5 <dbl> -0.33832077, 0.06001765, -0.50319813, -0.01030888, -0.4...
## $ V6 <dbl> 0.46238778, -0.08236081, 1.80049938, 1.24720317, 0.0959...
## $ V7 <dbl> 0.239598554, -0.078802983, 0.791460956, 0.237608940, 0....
## $ V8 <dbl> 0.098697901, 0.085101655, 0.247675787, 0.377435875, -0....
## $ V9 <dbl> 0.3637870, -0.2554251, -1.5146543, -1.3870241, 0.817739...
## $ V10 <dbl> 0.09079417, -0.16697441, 0.20764287, -0.05495192, 0.753...
## $ V11 <dbl> -0.55159953, 1.61272666, 0.62450146, -0.22648726, -0.82...
## $ V12 <dbl> -0.61780086, 1.06523531, 0.06608369, 0.17822823, 0.5381...
## $ V13 <dbl> -0.99138985, 0.48909502, 0.71729273, 0.50775687, 1.3458...
## $ V14 <dbl> -0.31116935, -0.14377230, -0.16594592, -0.28792375, -1....
## $ V15 <dbl> 1.468176972, 0.635558093, 2.345864949, -0.631418118, 0....
## $ V16 <dbl> -0.47040053, 0.46391704, -2.89008319, -1.05964725, -0.4...
## $ V17 <dbl> 0.207971242, -0.114804663, 1.109969379, -0.684092786, -...
## $ V18 <dbl> 0.02579058, -0.18336127, -0.12135931, 1.96577500, -0.03...
## $ V19 <dbl> 0.40399296, -0.14578304, -2.26185710, -1.23262197, 0.80...
## $ V20 <dbl> 0.25141210, -0.06908314, 0.52497973, -0.20803778, 0.408...
## $ V21 <dbl> -0.018306778, -0.225775248, 0.247998153, -0.108300452, ...
## $ V22 <dbl> 0.277837576, -0.638671953, 0.771679402, 0.005273597, 0....
## $ V23 <dbl> -0.110473910, 0.101288021, 0.909412262, -0.190320519, -...
## $ V24 <dbl> 0.06692807, -0.33984648, -0.68928096, -1.17557533, 0.14...
## $ V25 <dbl> 0.12853936, 0.16717040, -0.32764183, 0.64737603, -0.206...
## $ V26 <dbl> -0.18911484, 0.12589453, -0.13909657, -0.22192884, 0.50...
## $ V27 <dbl> 0.133558377, -0.008983099, -0.055352794, 0.062722849, 0...
## $ V28 <dbl> -0.021053053, 0.014724169, -0.059751841, 0.061457629, 0...
## $ Amount <dbl> 149.62, 2.69, 378.66, 123.50, 69.99, 3.67, 4.99, 40.80,...
## $ Class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
We see a total of 31 variables.
features V1 - V28 are a result of the PCA transformation and are simply numerical representations.
Time variable is the amount of time that passed from the time when the first transaction took place.
Amount is the value in dollars of the transaction
Class represents if the transaction is tagged as being a fraudulent transaction. 0 indicates the transaction is not fraudulent while a 1 indicates a fraudulent transaction. This will be our target variable.
summary(ccdata)
## Time V1 V2
## Min. : 0 Min. :-56.40751 Min. :-72.71573
## 1st Qu.: 54201 1st Qu.: -0.92037 1st Qu.: -0.59855
## Median : 84692 Median : 0.01811 Median : 0.06549
## Mean : 94814 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372
## Max. :172792 Max. : 2.45493 Max. : 22.05773
## NA's :1
## V3 V4 V5
## Min. :-48.3256 Min. :-5.68317 Min. :-113.74331
## 1st Qu.: -0.8904 1st Qu.:-0.84864 1st Qu.: -0.69160
## Median : 0.1799 Median :-0.01985 Median : -0.05434
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0272 3rd Qu.: 0.74334 3rd Qu.: 0.61193
## Max. : 9.3826 Max. :16.87534 Max. : 34.80167
##
## V6 V7 V8
## Min. :-26.1605 Min. :-43.5572 Min. :-73.21672
## 1st Qu.: -0.7683 1st Qu.: -0.5541 1st Qu.: -0.20863
## Median : -0.2742 Median : 0.0401 Median : 0.02236
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3986 3rd Qu.: 0.5704 3rd Qu.: 0.32735
## Max. : 73.3016 Max. :120.5895 Max. : 20.00721
##
## V9 V10 V11
## Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 15.59500 Max. : 23.74514 Max. :12.01891
##
## V12 V13 V14
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256
## Median : 0.1400 Median :-0.01357 Median : 0.0506
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268
##
## V15 V16 V17
## Min. :-4.49894 Min. :-14.12985 Min. :-25.16280
## 1st Qu.:-0.58288 1st Qu.: -0.46804 1st Qu.: -0.48375
## Median : 0.04807 Median : 0.06641 Median : -0.06568
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.64882 3rd Qu.: 0.52330 3rd Qu.: 0.39968
## Max. : 8.87774 Max. : 17.31511 Max. : 9.25353
##
## V18 V19 V20
## Min. :-9.498746 Min. :-7.213527 Min. :-54.49772
## 1st Qu.:-0.498850 1st Qu.:-0.456299 1st Qu.: -0.21172
## Median :-0.003636 Median : 0.003735 Median : -0.06248
## Mean : 0.000000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.500807 3rd Qu.: 0.458949 3rd Qu.: 0.13304
## Max. : 5.041069 Max. : 5.591971 Max. : 39.42090
##
## V21 V22 V23
## Min. :-34.83038 Min. :-10.933144 Min. :-44.80774
## 1st Qu.: -0.22839 1st Qu.: -0.542350 1st Qu.: -0.16185
## Median : -0.02945 Median : 0.006782 Median : -0.01119
## Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.18638 3rd Qu.: 0.528554 3rd Qu.: 0.14764
## Max. : 27.20284 Max. : 10.503090 Max. : 22.52841
##
## V24 V25 V26
## Min. :-2.83663 Min. :-10.29540 Min. :-2.60455
## 1st Qu.:-0.35459 1st Qu.: -0.31715 1st Qu.:-0.32698
## Median : 0.04098 Median : 0.01659 Median :-0.05214
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.43953 3rd Qu.: 0.35072 3rd Qu.: 0.24095
## Max. : 4.58455 Max. : 7.51959 Max. : 3.51735
##
## V27 V28 Amount
## Min. :-22.565679 Min. :-15.43008 Min. : 0.00
## 1st Qu.: -0.070840 1st Qu.: -0.05296 1st Qu.: 5.60
## Median : 0.001342 Median : 0.01124 Median : 22.00
## Mean : 0.000000 Mean : 0.00000 Mean : 88.35
## 3rd Qu.: 0.091045 3rd Qu.: 0.07828 3rd Qu.: 77.17
## Max. : 31.612198 Max. : 33.84781 Max. :25691.16
##
## Class
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.001728
## 3rd Qu.:0.000000
## Max. :1.000000
##
length(ccdata$Class)
## [1] 284807
summary(as.factor(ccdata$Class))
## 0 1
## 284315 492
492/284315
## [1] 0.001730475
Of the 284807 transactions contained in this dataset, only 492 of them are tagged as being fraudulent. This makes for a highly unbalanced dataset. Total frautulent activity is only .0017 which is very low. Balanced datasets are usually around 4:1.
# Checking for missing data
sapply(ccdata, function(x) sum(is.na(x)))
## Time V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 0 0 0 0 0 0 0 0 0
## V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 0 0 0 0 0 0 0 0 0 0
## V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
## 0 0 0 0 0 0 0 0 0 0
## Class
## 0
We see one missing value in the Time column so we take a look at that transaction
ccdata[which(is.na(ccdata$Time)), ]
## # A tibble: 1 x 31
## Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA -1.41 -1.62 2.55 0.386 0.503 0.507 -1.74 0.761 3.00 -1.16 0.394
## # ... with 19 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>,
## # V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>, V22
## # <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>, V28
## # <dbl>, Amount <dbl>, Class <int>
Here we see the transaction missing a time has not been marked as fraudulent and is for an amount of $49.50. All values in V1 - V28 are within normal ranges as found in the variable summaries above so we will keep this transaction in our dataset.
A quick visual analysis of our data prior to doing prediction work
# Visualize Amount feature
ggplot(ccdata, aes(x = Amount)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Calculate skewness using moments package
skewness(ccdata$Amount)
## [1] 16.97764
The Amount variable is heavily skewed right both visually and as indicated by a skewness of 16.9 (normal data will have skewness of 0). As such we will be normalizng this feature.
# Summary statistics of Amount feature by class
ccdata %>%
group_by(Class) %>%
summarise(min = min(Amount),
mean = mean(Amount),
max = max(Amount),
sd = sd(Amount))
## # A tibble: 2 x 5
## Class min mean max sd
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 88.3 25691 250
## 2 1 0 122 2126 257
ggplot(ccdata, aes(x = 1, y = Amount, fill = as.factor(Class), group = Class)) +
geom_boxplot()
This visual representation shows just how skewed the data is. Especially in the non-fraudulent cases, the data is so skewed that a boxplot is near useless.
Before normalizing, lets take a quick look at just the fraudulent cases
# Plot only fruadulent data
ccdata %>%
filter(Class == 1) %>%
ggplot(aes(x = 1, y = Amount)) +
geom_boxplot() +
geom_violin(fill = 'lightblue', color = 'lightblue', alpha = .7) +
scale_y_sqrt()
By using a square root scale we can better visualize where the data lies. The overlaid violin plot show us that there is a secondary peak near the third quartile.
The Time feature denotes the number of seconds that have passed between each transaction and the first transaction in the dataset. Since we do not have a begin time for the first transaction, the time column contains very little useful information. As such we will remove the ‘Time’ feature from our data
# Removing the TIME feature
ccdata <- ccdata[, -1]
colnames(ccdata)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [8] "V8" "V9" "V10" "V11" "V12" "V13" "V14"
## [15] "V15" "V16" "V17" "V18" "V19" "V20" "V21"
## [22] "V22" "V23" "V24" "V25" "V26" "V27" "V28"
## [29] "Amount" "Class"
# Convert Class to a factor
ccdata$Class <- as.factor(ccdata$Class)
# Create new dataframe on which we will create our models
ccproc <- ccdata
# Normalize the Amount feature
ccproc$Amount <- scale(ccproc$Amount, center = TRUE, scale = TRUE)
We will use a 80/20 train/test split
#Splitting the data
set.seed(42)
train <- sample(1:nrow(ccproc), nrow(ccproc)* .8)
model_data <- ccproc[train, ]
test_data <- ccproc[-train, ]
We do some preprocessing to get the data in a form compliant with keras.
# Predictor variable for the training and testing set
X_train <- model_data[, -30]
X_test <- test_data[, -30]
# Target Variable for training and testing set
Y_train <- model_data[, 30]
Y_test <- test_data[, 30]
To predict values in the test set, I will build two models:
Logistic Regression using the glm() function
Artificial Neural Network using keras
We will use the logistic regression model as a performance baseline to which we can compare the performance of the keras ANN.
The first model will be a logistic regression model. Logistic Regression models are popular because they are easy to build and give an output that is easy to understand.
# Fitting the model
glm_model <- glm(Class ~ . , family = binomial(link='logit'), data = model_data)
summary(glm_model)
##
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = model_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8107 -0.0297 -0.0195 -0.0125 4.5966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.6409790 0.1629605 -53.025 < 2e-16 ***
## V1 0.0816868 0.0459098 1.779 0.075193 .
## V2 0.0216503 0.0647928 0.334 0.738268
## V3 0.0265704 0.0503064 0.528 0.597381
## V4 0.6917694 0.0825821 8.377 < 2e-16 ***
## V5 0.1069355 0.0734670 1.456 0.145515
## V6 -0.1105463 0.0843440 -1.311 0.189973
## V7 -0.1064610 0.0734993 -1.448 0.147488
## V8 -0.1638044 0.0334983 -4.890 1.01e-06 ***
## V9 -0.2576267 0.1237902 -2.081 0.037420 *
## V10 -0.8069236 0.1055221 -7.647 2.06e-14 ***
## V11 -0.0814785 0.0845132 -0.964 0.335000
## V12 0.1196401 0.1005030 1.190 0.233884
## V13 -0.3360515 0.0926917 -3.625 0.000288 ***
## V14 -0.5853790 0.0705124 -8.302 < 2e-16 ***
## V15 -0.0361303 0.0945174 -0.382 0.702268
## V16 -0.2058326 0.1411167 -1.459 0.144676
## V17 -0.0249571 0.0772978 -0.323 0.746794
## V18 -0.0005058 0.1432247 -0.004 0.997182
## V19 0.1618482 0.1073241 1.508 0.131546
## V20 -0.4133680 0.0900533 -4.590 4.43e-06 ***
## V21 0.3624651 0.0650006 5.576 2.46e-08 ***
## V22 0.5979897 0.1435925 4.164 3.12e-05 ***
## V23 -0.1326601 0.0640901 -2.070 0.038462 *
## V24 0.0953750 0.1642361 0.581 0.561430
## V25 0.0532637 0.1424103 0.374 0.708393
## V26 -0.0262107 0.2117731 -0.124 0.901499
## V27 -0.7904580 0.1337405 -5.910 3.41e-09 ***
## V28 -0.2325387 0.0960162 -2.422 0.015441 *
## Amount 0.2369191 0.1046229 2.265 0.023543 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5875.3 on 227844 degrees of freedom
## Residual deviance: 1819.4 on 227815 degrees of freedom
## AIC: 1879.4
##
## Number of Fisher Scoring iterations: 12
# Use the anova() function to see feature statistics
anova(glm_model, test='Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 227844 5875.3
## V1 1 618.93 227843 5256.3 < 2.2e-16 ***
## V2 1 581.86 227842 4674.5 < 2.2e-16 ***
## V3 1 831.65 227841 3842.8 < 2.2e-16 ***
## V4 1 870.46 227840 2972.4 < 2.2e-16 ***
## V5 1 39.19 227839 2933.2 3.838e-10 ***
## V6 1 40.27 227838 2892.9 2.217e-10 ***
## V7 1 41.30 227837 2851.6 1.304e-10 ***
## V8 1 79.53 227836 2772.1 < 2.2e-16 ***
## V9 1 48.93 227835 2723.1 2.647e-12 ***
## V10 1 539.06 227834 2184.1 < 2.2e-16 ***
## V11 1 56.96 227833 2127.1 4.443e-14 ***
## V12 1 39.59 227832 2087.5 3.138e-10 ***
## V13 1 27.43 227831 2060.1 1.627e-07 ***
## V14 1 134.81 227830 1925.3 < 2.2e-16 ***
## V15 1 0.03 227829 1925.3 0.861331
## V16 1 42.38 227828 1882.9 7.505e-11 ***
## V17 1 2.20 227827 1880.7 0.138375
## V18 1 0.89 227826 1879.8 0.345281
## V19 1 0.47 227825 1879.3 0.494363
## V20 1 1.39 227824 1877.9 0.237828
## V21 1 8.87 227823 1869.1 0.002904 **
## V22 1 14.63 227822 1854.4 0.000131 ***
## V23 1 8.53 227821 1845.9 0.003487 **
## V24 1 0.13 227820 1845.8 0.716258
## V25 1 0.10 227819 1845.7 0.752440
## V26 1 0.07 227818 1845.6 0.788516
## V27 1 15.66 227817 1829.9 7.586e-05 ***
## V28 1 3.51 227816 1826.4 0.061096 .
## Amount 1 7.07 227815 1819.4 0.007860 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we use the logistic model to predict on the testing data.
fitted.results <- predict(glm_model, test_data[, 1:29], type = 'response')
fitted.results <- ifelse(fitted.results > .5, 1, 0)
conf_matrix_glm <- confusionMatrix(reference = unlist(Y_test),data = fitted.results)
conf_matrix_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56862 33
## 1 8 59
##
## Accuracy : 0.9993
## 95% CI : (0.999, 0.9995)
## No Information Rate : 0.9984
## P-Value [Acc > NIR] : 1.886e-09
##
## Kappa : 0.7418
## Mcnemar's Test P-Value : 0.0001781
##
## Sensitivity : 0.9999
## Specificity : 0.6413
## Pos Pred Value : 0.9994
## Neg Pred Value : 0.8806
## Prevalence : 0.9984
## Detection Rate : 0.9982
## Detection Prevalence : 0.9988
## Balanced Accuracy : 0.8206
##
## 'Positive' Class : 0
##
Of the 92 fraudulent transactions in our test set, the GLM model correctly predicted 59 occurences, had 8 false positives and predicted 33 false negatives.
This is good for an accuracy of .9993, Sensitivity of .9999 and Specificity of .6413
Next we want to use keras to create an artificial neural network. Our network will have two hidden layers with 10% dropout.
First we initialize the sequential model.
# Building the ANN model
model_keras <- keras_model_sequential()
model2 <- model_keras %>%
layer_dense(units = 16, kernel_initializer = 'uniform', activation = 'relu', input_shape = ncol(X_train)) %>%
layer_dropout(rate = 0.1) %>%
layer_dense(units = 8, kernel_initializer = 'uniform' , activation = 'relu') %>%
layer_dropout(rate = 0.1) %>%
layer_dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid')
# Compiling
compile2 <- model_keras %>%
compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = c('accuracy'))
compile2
## Model
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 16) 480
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 16) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 8) 136
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 8) 0
## ___________________________________________________________________________
## dense_3 (Dense) (None, 1) 9
## ===========================================================================
## Total params: 625
## Trainable params: 625
## Non-trainable params: 0
## ___________________________________________________________________________
summary(model2)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 16) 480
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 16) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 8) 136
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 8) 0
## ___________________________________________________________________________
## dense_3 (Dense) (None, 1) 9
## ===========================================================================
## Total params: 625
## Trainable params: 625
## Non-trainable params: 0
## ___________________________________________________________________________
Now we create the model and plot the outcome
# Fitting the model
kemodel <- fit(object = model2,
x = as.matrix(X_train),
y = as.matrix(Y_train),
batch_size = 100,
epochs = 30,
validation_split = 0.30)
# Plotting the output
plot(kemodel) +
labs( title = 'Deep Learning Training Result')
After just a few epochs, the accuracy becomes maximized and loss is minimized.
Now we use the ANN to predict on the test data and format into a new matrix with predicted probability.
# Predicting for the test data
pred_class <- predict_classes(object = model2,
x = as.matrix(X_test)) %>%
as.vector()
# Predicted class probability
pred_prob <- predict_proba(object = model2,
x = as.matrix(X_test)) %>%
as.vector()
Some post processing to gather the results into one tibble
# Format test data and prediction vector
predict_value <- tibble(
truth = as.factor(unlist(Y_test)) %>% fct_recode(Yes = '1', No = '0') ,
estimate = as.factor(pred_class) %>% fct_recode(Yes = '1', No = '0'),
pred_prob = pred_prob
)
print(predict_value)
## # A tibble: 56,962 x 3
## truth estimate pred_prob
## <fct> <fct> <dbl>
## 1 No No 0.000252
## 2 No No 0.000000000134
## 3 No No 0.000000124
## 4 No No 0.0000000683
## 5 No No 0.0000000556
## 6 No No 0.000000000000000517
## 7 No No 0.00000000954
## 8 No No 0.0000000000000114
## 9 No No 0.00000165
## 10 No No 0.000000331
## # ... with 56,952 more rows
Now to compare to the logistic regression model from earlier, lets create a confusion matrix.
# Confusion Matrix
conf_matrix_ann <- confusionMatrix(reference = predict_value$truth, data = predict_value$estimate)
conf_matrix_ann
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 56857 21
## Yes 13 71
##
## Accuracy : 0.9994
## 95% CI : (0.9992, 0.9996)
## No Information Rate : 0.9984
## P-Value [Acc > NIR] : 3.363e-12
##
## Kappa : 0.8065
## Mcnemar's Test P-Value : 0.2299
##
## Sensitivity : 0.9998
## Specificity : 0.7717
## Pos Pred Value : 0.9996
## Neg Pred Value : 0.8452
## Prevalence : 0.9984
## Detection Rate : 0.9982
## Detection Prevalence : 0.9985
## Balanced Accuracy : 0.8858
##
## 'Positive' Class : No
##
BY this confusion matrix, the ANN correctly predicted 75 fraud occurences, 17 false positives and 17 false negatives. THis is good for a Sensitivity of .9997 and a Specificisty of .8152.
Lets compare the two models two back to back
# ANN
conf_matrix_ann$table
## Reference
## Prediction No Yes
## No 56857 21
## Yes 13 71
conf_matrix_ann$byClass[1]
## Sensitivity
## 0.9997714
conf_matrix_ann$byClass[2]
## Specificity
## 0.7717391
predict_value %>% metrics(truth, estimate)
## # A tibble: 1 x 1
## accuracy
## <dbl>
## 1 0.999
# GLM
conf_matrix_glm$table
## Reference
## Prediction 0 1
## 0 56862 33
## 1 8 59
conf_matrix_glm$byClass[1]
## Sensitivity
## 0.9998593
conf_matrix_glm$byClass[2]
## Specificity
## 0.6413043
conf_matrix_glm$overall[1]
## Accuracy
## 0.9992802
Due to the high number of negatives in the testing dataset, both models have a very high Sensitivity. Though it should be noted that the GLM model produced less than half as many true positives.
The artificial Neural Network produced a significantly better Specificity. This means that the ANN was much better at picking up on fraudulent transactions though it is more likely to produce a false positive.
These two findings are reflected in the accuracy for both models. Once again due to the high number of negatives in the testing set, both values are very close to 1 though the acuracy for the ANN is slightly higher (1.22910^{-4} greater).
This is a difficult dataset to predict because of the extreme imbalance in positive and negative transactions. However, of the two models created here, the better model is the Artificial Neural Network. This is because even though it produces a higher number of false positives on non-fraudulent transactions, it is much better at picking up on true fraudulent activity.