The data was posted on kaggle for credit card fraud detection.Anonymized credit card transactions are labeled as genuine or fraudulent. The The data for the analysis is available here here.
The datasets contains transactions made by credit cards in September 2013 by european cardholders. These transactions are a subset of all online transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, where the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time(time between transactions in seconds)’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount(how much money was transferred in this transaction), this feature can be used for example-dependant cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
library(tidyverse)
library(h2o)
library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(tidyverse)
library(ggthemes)
library(knitr)
library(tidyverse)
library(caret)
library(caretEnsemble)
library(plotly)
library(lime)
library(plotROC)
library(pROC)
#h2o.init(nthreads=-1,enable_assertions = FALSE)
h2o.init(nthreads=-1)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 hours 6 minutes
H2O cluster version: 3.14.0.3
H2O cluster version age: 15 days
H2O cluster name: H2O_started_from_R_nanaakwasiabayieboateng_qlb223
H2O cluster total nodes: 1
H2O cluster total memory: 6.57 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.1 (2017-06-30)
localH2O = h2o.init(ip = 'localhost', port = 54321, nthreads = -1,max_mem_size = "8G")
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 hours 6 minutes
H2O cluster version: 3.14.0.3
H2O cluster version age: 15 days
H2O cluster name: H2O_started_from_R_nanaakwasiabayieboateng_qlb223
H2O cluster total nodes: 1
H2O cluster total memory: 6.57 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.1 (2017-06-30)
The dataset gives > 280,000 instances of credit card use and for each transaction, we know whether it was fraudulent or not.
Datasets like this needs special treatment when performing machine learning because they are severely unbalanced: in this case, only 0.17% of all transactions are fraudulent.
While we could try to work with classifiers, like random forests or support vector machines, by applying over- or under-sampling techniques, we can alternatively try to find anomalies in the data (assuming we expect our fraud cases to be anomalies within the whole dataset).
When dealing with such a severe unbalance of response labels, we also need to be careful when measuring model performance. Because there are only a handful of fraudulent instances, a model that predicts everything as non-fraud will already achieve a > 99% accuracy. But despite its high accuracy, such a model won’t necessarily help us find fraudulent cases - the proverbial “needle-in-a-haystack” - that we actually want to find!
Below, I will show how you can use autoencoders and anomaly detection, how you can use autoencoders to pre-train a classification model and how you can measure model performance on unbalanced data.
Set up for parallel processing.The parallel processing can reduce the computational time for such high volume of computation with deep neural networks.
# Calculate the number of cores
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
setwd("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20")
# download from https://www.kaggle.com/dalpozz/creditcardfraud
#creditcard <- read_csv("creditcard.csv")
creditcard <- import("creditcard.csv")
creditcard%>%head()
str(creditcard)
'data.frame': 284807 obs. of 31 variables:
$ Time : num 0 0 1 1 2 2 4 7 7 9 ...
$ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
$ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
$ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
$ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
$ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
$ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
$ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
$ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
$ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
$ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
$ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
$ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
$ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
$ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
$ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
$ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
$ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
$ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
$ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
$ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
$ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
$ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
$ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
$ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
$ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
$ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
$ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
$ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
$ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
$ Class : chr "0" "0" "0" "0" ...
psych::describe(creditcard)%>%as_tibble()
table(creditcard$Class)
0 1
284315 492
#Hmisc::describe(creditcard)
#==================================================================
# Histograms
#==================================================================
theme_set(theme_economist_white())
ggplot(creditcard, aes(x ="",y=V1, fill=Class))+ geom_boxplot()+labs(x="V1",y="")
ggplot(creditcard,aes(x = Amount)) +
geom_histogram(color = "#D53E4F", fill = "#D53E4F", bins = 50) +
facet_wrap( ~ Class, scales = "free", ncol = 2)
ggplot(creditcard, aes(x =Time,fill = Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)
ggplot(creditcard, aes(x =V2, fill=Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)
ggplot(creditcard, aes(x =V3, fill=Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)
ggplot(creditcard, aes(x =V4,fill=Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)
ggplot(creditcard, aes(x=V6, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()
ggplot(creditcard, aes(x=V7, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()
ggplot(creditcard, aes(x=V8, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()
#ggplot(Data, aes(x="",y =loan, fill=y)) + geom_histogram()
ggplot(creditcard, aes(x ="",y=V10, fill=Class))+ geom_violin(adjust = .5,draw_quantiles = c(0.25, 0.5, 0.75))+labs(x="V10",y="")
Most of the principal components are centered around zero.
creditcard %>%
ggplot(aes(x = Class)) +
geom_bar(color = "chocolate", fill = "chocolate", width = 0.2) +
theme_bw()
To convert the time in seconds to days, we know there is 86400 seconds in a day(60s by 60mins by 24hrs). We create two new variables credit card transactions in either the first or second day and credit card time of the day. We can use use an case statement or equivalently the ifelse statement for this step.
creditcard=creditcard %>%
mutate(Day = case_when(.$Time > 3600 * 24 ~ "day2",.$Time < 3600 * 24 ~ "day1"))
creditcard%>%head()
#equivalently
#creditcard$day <- if_else(creditcard$Time > 3600 * 24, "day2", "day1")
creditcard=creditcard %>%
mutate(Time_day = case_when(.$Day == "day2"~ .$Time - 86400 ,.$Day == "day1"~ .$Time))
#equivalently
# make transaction relative to day
#creditcard$Time_day <- if_else(creditcard$Day == "day2", creditcard$Time - 86400, creditcard$Time)
table(creditcard[,"Day"])
day1 day2
144786 140020
#creditcard%>%group_by(Day)%>%dplyr::summarise_at("Time_day",summary)
tapply(creditcard$Time_day,creditcard$Day,summary,simplify = FALSE)
$day1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 38432 54689 52948 70976 86398
$day2
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 37843 53425 51705 68182 86392
head(creditcard)
#alternatively
creditcard %>%
group_by(Day) %>%
summarise_at("Time_day",c("min", "max","mean","median"))%>%drop_na()
creditcard<-creditcard%>%mutate_if(is.character,as.factor)
#Alternatively
creditcard=creditcard %>%
mutate(Time_Group = case_when(.$Time_day <= 38138~ "g1" ,
.$Time_day <= 52327~ "g2",
.$Time_day <= 69580~"g3",
.$Time_day > 69580~"g4"))
head(creditcard)
# bin transactions according to time of day
# creditcard$Time <- as.factor(ifelse(creditcard$Time_day <= 38138, "gr1", # mean 1st Qu.
# ifelse(creditcard$Time_day <= 52327, "gr2", # mean mean
# ifelse(creditcard$Time_day <= 69580, "gr3", # mean 3rd Qu
# "gr4"))))
#
# creditcard$Time%>%head()
creditcard %>%drop_na()%>%
ggplot(aes(x = Day)) +
geom_bar(fill = "chocolate",width = 0.3,color="chocolate") +
theme_economist_white()
The number of transactions in the two days are roughly the same.They are both a little under 150000.
# convert class variable to factor
creditcard$Class <- factor(creditcard$Class)
#c(" Day","Time_day")
#creditcard <- select(creditcard,-Day)
str(creditcard)
'data.frame': 284807 obs. of 34 variables:
$ Time : num 0 0 1 1 2 2 4 7 7 9 ...
$ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
$ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
$ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
$ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
$ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
$ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
$ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
$ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
$ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
$ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
$ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
$ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
$ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
$ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
$ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
$ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
$ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
$ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
$ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
$ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
$ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
$ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
$ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
$ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
$ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
$ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
$ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
$ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
$ Amount : num 149.62 2.69 378.66 123.5 69.99 ...
$ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Day : Factor w/ 2 levels "day1","day2": 1 1 1 1 1 1 1 1 1 1 ...
$ Time_day : num 0 0 1 1 2 2 4 7 7 9 ...
$ Time_Group: chr "g1" "g1" "g1" "g1" ...
creditcard %>%drop_na()%>%
ggplot(aes(x = Time_Group)) +
geom_bar(color = "#238B45", fill = "#238B45") +
theme_bw() +
facet_wrap( ~ Class, scales = "free", ncol = 2)
The distribution of transactions over the four Time bins shows, that the majority of fraud cases have happened in group 1 whereas the distribution of genuine transactions remained fairly the same over the four groups.
tapply(creditcard$Amount ,creditcard$Class,summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 5.65 22.00 88.29 77.05 25691.16
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 9.25 122.21 105.89 2125.87
The fraudulent credit card transactions had a higher mean amount of money that was transferred, but the maximum amount was much lower compared to genuine transactions.The genuine transactions also had a higher median.
We use the h2o infracstructure to train our deep learning models and also perform the anomaly detection.
# convert data to H2OFrame
creditcard_h2o <- as.h2o(creditcard)
|
| | 0%
|
|=========================================================================================| 100%
We split the data into training set, validation and test set at 60%,20% and 20% respectively. The test and validation set would be used for model validation and prediction.
splits <- h2o.splitFrame(creditcard_h2o,
ratios = c(0.6, 0.2),
seed = 148) #partition data into 60%, 20%, 20% chunks
train <- splits[[1]]
validation <- splits[[2]]
test <- splits[[3]]
outcome_name <- "Class"
features <- setdiff(colnames(train), outcome_name)
An autoencoder is an artificial neural network used for unsupervised learning of efficient codings. The goal of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. The model will have to reduce the dimensionality of the input data (in this case, down to 5 nodes/dimensions). The autoencoder will learn the patterns in the credit card transactions to identify anomalies and similar transactions. Autoencoding reduces the feature space in order to distill the essential aspects of the data unlike most conventional deeplearning which blows up the feature space up to capture non-linearities and subtle interactions within the data. Autoencoding can also be seen as a non-linear alternative to PCA.
The tanh function is a rescaled and shifted logistic function.It’s symmetry around 0 speeds up the training algorithm to converge faster.Another loss function avaialble in the h2o architecture is the rectified linear activation function. It has demonstrated high performance on image recognition tasks from practice , and is a more biologically accurate model of neuron activations.The third loss function available is Maxout which is a generalization of the Rectifier activation, where each neuron picks the larger output of k separate channels, each with its own weights and bias values.
We build our first model nelow. We train a deep learning model by setting the autoencoder equals to true. y should not be specified for autoencoders set to true.
model_one = h2o.deeplearning(x = features, training_frame = train,
autoencoder = TRUE,
reproducible = TRUE,
seed = 148,
hidden = c(10,10,10), epochs = 100,activation = "Tanh",
validation_frame = test)
#model_id = " model_one")
We save the model to file to reuse later to avoid wasting time running each time.
h2o.saveModel(model_one, path="model_one", force = TRUE)
[1] "/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/model_one/DeepLearning_model_R_1507439493390_3791"
model_one <- h2o.loadModel("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/model_one/DeepLearning_model_R_1507439493390_3791")
model_one
Model Details:
==============
H2OAutoEncoderModel: deeplearning
Model ID: DeepLearning_model_R_1507439493390_3791
Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 944 weights/biases, 20.1 KB, 2,739,472 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms
1 1 34 Input 0.00 %
2 2 10 Tanh 0.00 % 0.000000 0.000000 0.610547 0.305915 0.000000 -0.000347 0.309377
3 3 10 Tanh 0.00 % 0.000000 0.000000 0.181705 0.103598 0.000000 0.022774 0.262611
4 4 10 Tanh 0.00 % 0.000000 0.000000 0.133090 0.079663 0.000000 0.000808 0.337259
5 5 34 Tanh 0.000000 0.000000 0.116252 0.129859 0.000000 0.006941 0.357547
mean_bias bias_rms
1
2 -0.028166 0.148318
3 -0.056455 0.099918
4 0.032588 0.101952
5 0.167973 0.688510
H2OAutoEncoderMetrics: deeplearning
** Reported on training data. **
Training Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.0003654009
RMSE: (Extract with `h2o.rmse`) 0.01911546
H2OAutoEncoderMetrics: deeplearning
** Reported on validation data. **
Validation Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.0003508435
RMSE: (Extract with `h2o.rmse`) 0.01873082
There are a number of utility functions that allow us to inspect the model.
h2o.scoreHistory(model_one)%>%head()
Scoring History:
timestamp duration training_speed epochs iterations samples training_rmse
1 2017-10-08 03:45:10 1.664 sec 0.00000 obs/sec 0.00000 0 0.000000 0.22811
2 2017-10-08 03:45:16 6.986 sec 33149 obs/sec 1.00000 1 171217.000000 0.02151
3 2017-10-08 03:45:21 12.335 sec 33165 obs/sec 2.00000 2 342434.000000 0.01944
4 2017-10-08 03:45:26 17.270 sec 33976 obs/sec 3.00000 3 513651.000000 0.01971
5 2017-10-08 03:45:31 22.126 sec 34550 obs/sec 4.00000 4 684868.000000 0.01952
6 2017-10-08 03:45:36 27.030 sec 34817 obs/sec 5.00000 5 856085.000000 0.01973
training_mse validation_rmse validation_mse
1 0.05203 0.22758 0.05179
2 0.00046 0.02113 0.00045
3 0.00038 0.01909 0.00036
4 0.00039 0.01934 0.00037
5 0.00038 0.01919 0.00037
6 0.00039 0.01937 0.00038
We can use the h2o.predict function on the test set to predict the fraudulent and genuine transactions.
#Convert to autoencoded representation
test_autoencoder <- h2o.predict(model_one, test)
|
| | 0%
|
|=========================================================================================| 100%
We can extract this hidden feature with the h2o.deepfeatures() function.This shows the reduced data in 5 columns( the inner layer in the hidden layer has 5 nodes).
train_features <- h2o.deepfeatures(model_one, train, layer = 2) %>%
as.data.frame() %>%
mutate(Class = as.vector(train[, 31]))
|
| | 0%
|
|== | 3%
|
|=========================================================================================| 100%
train_features%>%head()
ggplot(train_features, aes(x = DF.L2.C1, y = DF.L2.C2, color = Class)) +
geom_point(alpha = 0.1,size=1.5)+theme_bw()+
scale_fill_brewer(palette = "Accent")
ggplot(train_features, aes(x = DF.L2.C3, y = DF.L2.C4, color = Class)) +
geom_point(alpha = 0.1,size=1.5)+theme_bw()+
scale_fill_brewer(palette = "Accent")
The fraudulent transactions is detected by this dimensionality reduction approach using our autoencoder model.The inner hidden layer of 10 does detect the fraudualent transactions. We can also train a new model with the other hidden layers using our first model. This results in 10 columns since the third layer has 10 nodes.
# let's take the third hidden layer
train_features <- h2o.deepfeatures(model_one, validation, layer = 3) %>%
as.data.frame() %>%
mutate(Class = as.factor(as.vector(validation[, 31]))) %>%
as.h2o()
|
| | 0%
|
|== | 3%
|
|=========================================================================================| 100%
|
| | 0%
|
|=========================================================================================| 100%
train_features%>%head()
features_two <- setdiff(colnames(train_features), outcome_name)
features_two
[1] "DF.L3.C1" "DF.L3.C2" "DF.L3.C3" "DF.L3.C4" "DF.L3.C5" "DF.L3.C6" "DF.L3.C7"
[8] "DF.L3.C8" "DF.L3.C9" "DF.L3.C10"
model_two <- h2o.deeplearning(y = outcome_name,
x = features_two,
training_frame = train_features,
reproducible = TRUE,
balance_classes = TRUE,
ignore_const_cols = FALSE,
seed = 148,
hidden = c(10, 5, 10),
epochs = 100,
activation = "Tanh")
|
| | 0%
|
|== | 2%
|
|==== | 4%
|
|===== | 6%
|
|======= | 8%
|
|========= | 10%
|
|=========== | 12%
|
|============ | 14%
|
|============== | 16%
|
|================ | 18%
|
|================== | 20%
|
|==================== | 22%
|
|===================== | 24%
|
|======================= | 26%
|
|========================= | 28%
|
|=========================== | 30%
|
|============================ | 32%
|
|============================== | 34%
|
|================================ | 36%
|
|================================== | 38%
|
|==================================== | 40%
|
|=========================================================================================| 100%
h2o.saveModel(model_two, path="model_two", force = TRUE)
[1] "/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/model_two/DeepLearning_model_R_1507439493390_3867"
model_two <- h2o.loadModel("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/model_two/DeepLearning_model_R_1507439493390_3792")
model_two
Model Details:
==============
H2OBinomialModel: deeplearning
Model ID: DeepLearning_model_R_1507439493390_3792
Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 247 weights/biases, 8.0 KB, 2,383,962 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms
1 1 10 Input 0.00 %
2 2 10 Tanh 0.00 % 0.000000 0.000000 0.001515 0.001883 0.000000 -0.149216 0.768610
3 3 5 Tanh 0.00 % 0.000000 0.000000 0.003293 0.004916 0.000000 -0.251950 0.885017
4 4 10 Tanh 0.00 % 0.000000 0.000000 0.002252 0.001780 0.000000 0.073398 1.217405
5 5 2 Softmax 0.000000 0.000000 0.007459 0.007915 0.000000 -0.095975 3.579932
mean_bias bias_rms
1
2 -0.038682 0.891455
3 -0.307971 0.531144
4 -0.354956 0.887678
5 0.223286 1.172508
H2OBinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9892 samples **
MSE: 0.1129424
RMSE: 0.336069
LogLoss: 0.336795
Mean Per-Class Error: 0.006234916
AUC: 0.9983689
Gini: 0.9967379
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4910 62 0.012470 =62/4972
1 0 4920 0.000000 =0/4920
Totals 4910 4982 0.006268 =62/9892
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.009908 0.993739 153
2 max f2 0.009908 0.997486 153
3 max f0point5 0.019214 0.990107 142
4 max accuracy 0.009908 0.993732 153
5 max precision 1.000000 1.000000 0
6 max recall 0.009908 1.000000 153
7 max specificity 1.000000 1.000000 0
8 max absolute_mcc 0.009908 0.987543 153
9 max min_per_class_accuracy 0.019214 0.989541 142
10 max mean_per_class_accuracy 0.009908 0.993765 153
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
For measuring model performance on test data, we need to convert the test data to the same reduced dimensions as the trainings data:
test_3 <- h2o.deepfeatures(model_one, test, layer = 3)
|
| | 0%
|
|=================== | 21%
|
|=========================================================================================| 100%
test_3%>%head()
#train_features <- h2o.deepfeatures(model_one, test, layer = 3)
test_pred=h2o.predict(model_two, test_3,type="response")%>%
as.data.frame() %>%
mutate(actual = as.vector(test[, 31]))
|
| | 0%
|
|=========================================================================================| 100%
test_pred%>%head()
test_pred%>%tail()
h2o.predict(model_two, test_3) %>%
as.data.frame() %>%
dplyr::mutate(actual = as.vector(test[, 31])) %>%
group_by(actual, predict) %>%
dplyr::summarise(n = n()) %>%
mutate(freq = n / sum(n))
|
| | 0%
|
|=========================================================================================| 100%
The second model correctly predicts the the genuine transactions accurately at 81% but does not do a good job at predicting the fraudulent just under 64%.
We can also ask which instances were considered outliers or anomalies within our test data, using the h2o.anomaly() function. Based on the autoencoder model that was trained before, the input data will be reconstructed and for each instance, the mean squared error (MSE) between actual value and reconstruction is calculated.
We compute mean MSE for both class labels.interesting per feature error scores.
anomaly <- h2o.anomaly(model_one, test) %>%
as.data.frame() %>%
tibble::rownames_to_column() %>%
mutate(Class = as.vector(test[, 31]))
mean_mse <- anomaly %>%
group_by(Class) %>%
summarise(mean = mean(Reconstruction.MSE))
anomaly<-anomaly%>%mutate_if(is.character,as.factor)
anomaly$rowname=as.numeric(anomaly$rowname)
anomaly%>%head()
mean_mse
We reconstruct the original data set using the reduced set of features from model_one and calculate mean squared error between both.We set per_feature parameter to FALSE in the h2o.anomaly function call as we want a reconstruction mean error based on observations, not individual features.
creditcard.anon = h2o.anomaly(model_one, train, per_feature=FALSE)
MSE<-creditcard.anon%>%as_tibble()
MSE$Index<-1:length(MSE$Reconstruction.MSE)
ggplot(MSE,aes(x=Index,y=sort(Reconstruction.MSE)))+geom_point()+ylab("Reconstruction.MSE")+theme_economist_white()
anomaly%>%head()
anomaly%>%ggplot( aes(x = rowname, y = Reconstruction.MSE,color=Class)) +
geom_point(alpha = 0.3) +
#geom_hline(data = mean_mse, aes(yintercept = mean)) +
geom_hline(yintercept =0.02,color="#3288BD") +
#scale_color_brewer(palette="Reds")
scale_color_manual(breaks = c("0", "1"),values=c("#99D594" ,"#D53E4F"))+
labs(x = "instance number", color = "Class")
We can consider any value above the mean squared error as anomaly. There is no perfect cluster/classification between genuine and fraudulent transactions.
anomaly <- anomaly %>%
mutate(outlier = ifelse(Reconstruction.MSE > 0.02 , "outlier", "no_outlier"))
anomaly %>%
group_by(Class, outlier) %>%
dplyr:: summarise(n = n()) %>%
mutate(freq = n / sum(n))
We can train a supervised model using the autoconder model trained in model_one.
model_three <- h2o.deeplearning(y = outcome_name,
x = features,
training_frame = train,
reproducible = TRUE,
balance_classes = TRUE,
# pretrained_autoencoder = "model_one",
seed = 148,
hidden = c(10, 2, 10),
epochs = 100,
activation = "Tanh")
|
| | 0%
|
|== | 2%
|
|==== | 4%
|
|===== | 6%
|
|======= | 8%
|
|========= | 10%
|
|=========== | 12%
|
|============ | 14%
|
|============== | 16%
|
|================ | 18%
|
|================== | 20%
|
|==================== | 22%
|
|===================== | 24%
|
|======================= | 26%
|
|========================= | 28%
|
|=========================== | 30%
|
|============================ | 32%
|
|============================== | 34%
|
|================================ | 36%
|
|================================== | 38%
|
|==================================== | 40%
|
|===================================== | 42%
|
|======================================= | 44%
|
|========================================= | 46%
|
|=========================================== | 48%
|
|============================================ | 50%
|
|============================================== | 52%
|
|================================================ | 54%
|
|================================================== | 56%
|
|==================================================== | 58%
|
|===================================================== | 60%
|
|======================================================= | 62%
|
|========================================================= | 64%
|
|=========================================================== | 66%
|
|============================================================ | 68%
|
|============================================================== | 70%
|
|================================================================ | 72%
|
|================================================================== | 74%
|
|==================================================================== | 76%
|
|===================================================================== | 78%
|
|======================================================================= | 80%
|
|========================================================================= | 82%
|
|=========================================================================== | 84%
|
|============================================================================ | 86%
|
|============================================================================== | 88%
|
|================================================================================ | 90%
|
|================================================================================== | 92%
|
|==================================================================================== | 94%
|
|===================================================================================== | 96%
|
|=========================================================================================| 100%
h2o.saveModel(model_three, path="model_three", force = TRUE)
[1] "/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/model_three/DeepLearning_model_R_1507439493390_3890"
model_three <- h2o.loadModel("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/H20/model_three/DeepLearning_model_R_1507439493390_3815")
model_three
Model Details:
==============
H2OBinomialModel: deeplearning
Model ID: DeepLearning_model_R_1507439493390_3815
Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 424 weights/biases, 12.1 KB, 16,750,748 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms
1 1 34 Input 0.00 %
2 2 10 Tanh 0.00 % 0.000000 0.000000 0.502669 0.462040 0.000000 0.071233 1.666783
3 3 2 Tanh 0.00 % 0.000000 0.000000 0.001296 0.001975 0.000000 0.018401 0.692802
4 4 10 Tanh 0.00 % 0.000000 0.000000 0.001411 0.000409 0.000000 0.352422 1.897482
5 5 2 Softmax 0.000000 0.000000 0.003503 0.000459 0.000000 5.364876 14.779671
mean_bias bias_rms
1
2 0.928092 1.454369
3 -0.425624 1.263055
4 -0.690195 1.860847
5 20.088384 10.134045
H2OBinomialMetrics: deeplearning
** Reported on training data. **
** Metrics reported on temporary training frame with 9843 samples **
MSE: 0.000706574
RMSE: 0.02658146
LogLoss: 0.007980691
Mean Per-Class Error: 0.0005104124
AUC: 0.9999345
Gini: 0.999869
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4893 5 0.001021 =5/4898
1 0 4945 0.000000 =0/4945
Totals 4893 4950 0.000508 =5/9843
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.873948 0.999495 134
2 max f2 0.873948 0.999798 134
3 max f0point5 0.873948 0.999192 134
4 max accuracy 0.873948 0.999492 134
5 max precision 0.992492 1.000000 0
6 max recall 0.873948 1.000000 134
7 max specificity 0.992492 1.000000 0
8 max absolute_mcc 0.873948 0.998985 134
9 max min_per_class_accuracy 0.873948 0.998979 134
10 max mean_per_class_accuracy 0.873948 0.999490 134
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
pred <- as.data.frame(h2o.predict(object = model_three, newdata = test)) %>%
mutate(actual = as.vector(test[, 31]))
|
| | 0%
|
|=========================================================================================| 100%
pred %>%
group_by(actual, predict) %>%
dplyr::summarise(n = n()) %>%
mutate(freq = n / sum(n))
pred %>%
ggplot(aes(x = actual, fill = predict)) +
geom_bar() +
theme_bw() +
scale_fill_brewer(palette = "Accent") +
facet_wrap( ~ actual, scales = "free", ncol = 2)
Among the genuine transactions, model_three correctly predicts with over 99% accuracy whereas it predicts a fraudulent transaction to about 81% accuracy.
The very high frequency for genuine transactions creates a bias against the fraudulent transactions. We have a case of unbalanced classes.We can not use performance measures like accuracy or area under the curve (AUC), as they would give overly optimistic results based on the high percentage of correct classifications of the majority class.
Sensitivity ( true positive rate, the recall, or probability of detection) measures the proportion of positives that are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity ( true negative rate) measures the proportion of negatives that are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition)
library(ROCR)
# http://stackoverflow.com/questions/24563061/computing-integral-of-a-line-plot-in-r
line_integral <- function(x, y) {
dx <- diff(x)
end <- length(y)
my <- (y[1:(end - 1)] + y[2:end]) / 2
sum(dx * my)
}
prediction_obj <- prediction(pred$p1, pred$actual)
par(mfrow = c(1, 2))
par(mar = c(5.1,4.1,4.1,2.1))
# precision-recall curve
perf1 <- performance(prediction_obj, measure = "prec", x.measure = "rec")
x <- perf1@x.values[[1]]
y <- perf1@y.values[[1]]
y[1] <- 0
plot(perf1, main = paste("Area Under the\nPrecision-Recall Curve:\n", round(abs(line_integral(x,y)), digits = 3)))
# sensitivity-specificity curve
perf2 <- performance(prediction_obj, measure = "sens", x.measure = "spec")
x <- perf2@x.values[[1]]
y <- perf2@y.values[[1]]
y[1] <- 0
plot(perf2, main = paste("Area Under the\nSensitivity-Specificity Curve:\n", round(abs(line_integral(x,y)), digits = 3)))
thresholds <- seq(from = 0, to = 1, by = 0.1)
pred_thresholds <- data.frame(actual = pred$actual)
for (threshold in thresholds) {
prediction <- ifelse(pred$p1 > threshold, 1, 0)
prediction_true <- ifelse(pred_thresholds$actual == prediction, TRUE, FALSE)
pred_thresholds <- cbind(pred_thresholds, prediction_true)
}
colnames(pred_thresholds)[-1] <- thresholds
pred_thresholds %>%
gather(x, y, 2:ncol(pred_thresholds)) %>%
group_by(actual, x, y) %>%
dplyr::summarise(n = n()) %>%
ggplot(aes(x = as.numeric(x), y = n, color = actual)) +
geom_vline(xintercept = 0.6, alpha = 0.5) +
geom_line() +
geom_point(alpha = 0.5) +
theme_bw() +
facet_wrap(actual ~ y, scales = "free", ncol = 2) +
labs(x = "prediction threshold",
y = "number of instances")
This plot tells us that we can increase the number of correctly classified non-fraud cases without loosing correctly classified fraud cases when we increase the prediction threshold from the default 0.5 to 0.6:
pred %>%
mutate(predict = ifelse(pred$p1 > 0.6, 1, 0)) %>%
group_by(actual, predict) %>%
dplyr::summarise(n = n()) %>%
mutate(freq = n / sum(n))
The final model now correctly identified 82% of fraud cases and almost 100% of non-fraud cases.
predictor=h2o.predict(model_one,validation)
|
| | 0%
|
|=========================================================================================| 100%
predictor
reconstr_Day.day1 reconstr_Day.day2 reconstr_Day.missing(NA) reconstr_Time reconstr_V1
1 0.9999810 0.0088891828 0.0016046761 4890.516 -0.46847048
2 0.9999791 -0.0046075421 0.0004958399 3838.492 -0.02448993
3 0.9999803 -0.0039506194 0.0008929561 3907.199 0.05040905
4 0.9999781 -0.0010562932 0.0010720840 4066.562 0.12708346
5 0.9999780 0.0002975057 0.0009419790 4012.287 0.18692148
6 0.9999781 0.0002093632 0.0011922675 4119.922 0.15399039
reconstr_V2 reconstr_V3 reconstr_V4 reconstr_V5 reconstr_V6 reconstr_V7 reconstr_V8 reconstr_V9
1 -0.21291205 1.2353052 -0.8359950 -0.7212934 0.4526312 -0.11879785 0.11944333 0.1019651
2 -0.05983887 0.8763147 0.5106733 -0.3639185 0.4835525 -0.23434344 -0.18214336 0.2516807
3 -0.12720027 0.8439560 -0.3667428 -0.4276805 0.3837319 -0.12716381 -0.07519365 0.2891557
4 -0.13691489 0.7002918 0.0484221 -0.3522574 0.5146559 -0.16214477 -0.27497546 0.3650589
5 -0.08397237 0.7472537 1.4957562 -0.2114755 0.4977608 -0.12482598 -0.17237711 0.3645234
6 -0.13109489 0.6761142 1.2720411 -0.2706485 0.4801763 -0.01487091 -0.07680689 0.3579488
reconstr_V10 reconstr_V11 reconstr_V12 reconstr_V13 reconstr_V14 reconstr_V15 reconstr_V16
1 -0.07570575 -0.20951263 -1.0766269 0.54286748 0.3896126 -0.685506905 -0.3047278
2 0.05194440 -0.72347388 -0.9768606 1.35954830 0.5132597 0.151156589 -0.1813253
3 0.01606018 -0.52817269 -0.9298266 1.73650489 0.5313205 0.614844145 -0.1774252
4 0.03161346 -0.08923635 -1.0947860 -0.03076353 0.5686297 0.911664473 -0.2068776
5 0.14570003 0.07460730 -0.8218927 0.67852585 0.5959739 -0.000708625 -0.1211473
6 0.10065008 1.18084728 -0.8312289 -0.51194174 0.6184612 -0.223709360 -0.1383939
reconstr_V17 reconstr_V18 reconstr_V19 reconstr_V20 reconstr_V21 reconstr_V22 reconstr_V23
1 -0.04447316 1.8043887 -1.2428462 -0.28093179 -0.01231993 -0.2636693 -0.0501585914
2 0.19810131 -0.1522527 0.7202632 -0.03191134 0.07982234 -0.1810774 -0.0607544492
3 0.17182023 -0.4827150 0.2859410 -0.12927952 0.04460432 -0.1591398 -0.0350346277
4 0.17417276 0.2454507 0.6248470 -0.07568934 0.09271094 -0.1900116 -0.0170971326
5 0.32476217 -1.0208377 0.4393752 -0.05467679 0.11748975 -0.1695218 -0.0023311029
6 0.26654033 -0.1076239 -0.3657769 -0.16086502 0.09864791 -0.1644872 -0.0001578228
reconstr_V24 reconstr_V25 reconstr_V26 reconstr_V27 reconstr_V28 reconstr_Amount
1 -1.08302826 0.5561263 -0.20619109 0.03303782 0.08590619 124.36830
2 0.18289728 0.4795835 0.49089385 0.03681535 0.05087932 86.52969
3 -0.60926371 0.4451906 -0.04808316 0.02431382 0.05704648 74.84240
4 0.97749277 0.4529765 -0.44711975 0.05251211 0.06897125 85.69969
5 0.63978441 0.4208858 -0.25137048 0.03722372 0.04729799 89.81052
6 0.03864071 0.3982487 -0.36770638 0.03159265 0.05781315 109.57149
reconstr_Time_day
1 5101.705
2 3877.441
3 4080.574
4 4114.671
5 3892.737
6 4041.221
[56863 rows x 34 columns]
test_pred=test_pred%>%mutate_if(is.factor,as.numeric)
auc_rf = pROC::roc(response=test_pred[,"actual"],
predictor=test_pred[,"predict"])
plot(auc_rf, print.thres = "best", main=paste('AUC:',round(auc_rf$auc[[1]],3)))
abline(h=1,col="#3288BD")
abline(h=0,col="#D53E4F")
test_pred=test_pred%>%mutate_if(is.character,as.numeric)
#str(test_pred)
r <-pROC::roc(test_pred$predict,test_pred$actual)
#td <- broom::tidy(r)
# ggplot(td, aes(fpr, tpr)) +
# geom_line()+labs(title="The Area Under the ROC Curve",y="TPR/Sensitivity",x="FPR/1-Specitivity")+theme_bw()+geom_abline(slope=1,intercept = 0,color="red")
#
# r <- roc(churn$predictions,churn$labels)
#
# td <- tidy(r)
#
#
#
# library(ggplot2)
# ggplot(td, aes(fpr, tpr)) +
# geom_line()+labs(title="Area under the ROC Curve",y="TPR/Sensitivity",x="FPR/1-Specitivity")+theme_bw()+geom_abline(slope=1,intercept = 0,color="red")
#
#rs <- smooth(r, method="density")
#plot(r, add=TRUE, col="green")
#pROC::plot.roc(r)
pROC::ggroc(auc_rf, alpha = 0.5, colour = "#FC8D59", linetype = 1, size = 2)+geom_abline(slope=1,intercept = 1,color="black")+theme_bw()+labs(title="Area under the ROC Curve",y="TPR/Sensitivity",x="FPR/1-Specitivity")+ ggplot2::annotate("text",x=0.5,y=0.6, label = paste('AUC:',round(auc_rf$auc[[1]],3)))
#h2o.shutdown()
stopImplicitCluster()