We are using a Synthetic Financial Dataset for Fraud Detection which is a simulator generated dataset of mobile money transactions generated for fraud detection research. The dataset has more than 10 lakh observations with 11 features.
Details of features:-
Step - maps a unit of time in the real world. In this case 1 step is 1 hour of time.
Type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
Amt - amount of the transaction in local currency.
NameOrig - customer who started the transaction
OldBalanceOrg - initial balance before the transactions
NewBalanceOrig - new balance after the transaction
NameDest - customer who is the recipient of the transaction
OldBalanceDest - initial balance of recipient before the transaction.
NewBalanceDest - new balance of recipient after the transaction.
isFraud - This is the transactions made by the fraudulent agents inside the simulation.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction
#libraries used
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(caTools)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart)
library(rpart.plot)
library(Matrix)
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(magrittr)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 3.6.2
library(Ckmeans.1d.dp)
#Import data into r environment
rawdata <- read.csv('C:/Myfolder/SyntheticData/SynData.csv')
#first 10 observations
head(rawdata, 10)
## Step Type Amt NameOrig OldBalanceOrig NewBalanceOrig NameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181.0 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181.0 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860.0 46042.29 M573487274
## 7 1 PAYMENT 7107.77 C154988899 183195.0 176087.23 M408069119
## 8 1 PAYMENT 7861.64 C1912850431 176087.2 168225.59 M633326333
## 9 1 PAYMENT 4024.36 C1265012928 2671.0 0.00 M1176932104
## 10 1 DEBIT 5337.77 C712410124 41720.0 36382.23 C195600860
## OldBalanceDest NewBalanceDest isFraud isFlaggedFraud
## 1 0 0.00 0 0
## 2 0 0.00 0 0
## 3 0 0.00 1 0
## 4 21182 0.00 1 0
## 5 0 0.00 0 0
## 6 0 0.00 0 0
## 7 0 0.00 0 0
## 8 0 0.00 0 0
## 9 0 0.00 0 0
## 10 41898 40348.79 0 0
#View structure of data
str(rawdata)
## 'data.frame': 1048575 obs. of 11 variables:
## $ Step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ Amt : num 9840 1864 181 181 11668 ...
## $ NameOrig : Factor w/ 1048317 levels "C1000001725",..: 125165 361320 165237 961663 567916 994158 298038 494238 143414 891803 ...
## $ OldBalanceOrig: num 170136 21249 181 181 41554 ...
## $ NewBalanceOrig: num 160296 19385 0 0 29886 ...
## $ NameDest : Factor w/ 449635 levels "C1000015936",..: 274918 286717 73551 65465 138002 371072 340846 381943 128187 47318 ...
## $ OldBalanceDest: num 0 0 0 21182 0 ...
## $ NewBalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
table(rawdata$Step)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 2708 1014 552 565 665 1660 6837 21097 37628 35991 37241 36153 37515
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 41485 44609 42471 43361 49579 51352 40625 19152 12635 6144 3216 1598 440
## 27 28 29 30 31 32 33 34 35 36 37 38 39
## 41 4 4 8 12 12 23616 30904 29157 39774 34000 31453 23391
## 40 41 42 43 44 45 46 47 48 49 50 51 52
## 34270 36348 41304 45060 38523 18500 12445 8681 5693 764 6 14 8
## 53 54 55 56 57 58 59 60 61 62 63 64 65
## 10 4 12 18 6 20 20 12 14 8 8 18 20
## 66 67 68 69 70 71 72 73 74 75 76 77 78
## 24 6 18 20 6 20 14 8 22 8 4 10 10
## 79 80 81 82 83 84 85 86 87 88 89 90 91
## 10 10 10 10 12 16 14 18 6 8 6 16 8
## 92 93 94 95
## 10 4444 10372 2980
#summary of data
summary(rawdata)
## Step Type Amt NameOrig
## Min. : 1.00 CASH_IN :227130 Min. : 0 C1026010130: 2
## 1st Qu.:15.00 CASH_OUT:373641 1st Qu.: 12149 C1028315484: 2
## Median :20.00 DEBIT : 7178 Median : 76343 C1039623203: 2
## Mean :26.97 PAYMENT :353873 Mean : 158667 C1041466144: 2
## 3rd Qu.:39.00 TRANSFER: 86753 3rd Qu.: 213762 C1052471021: 2
## Max. :95.00 Max. :10000000 C1056260641: 2
## (Other) :1048563
## OldBalanceOrig NewBalanceOrig NameDest OldBalanceDest
## Min. : 0 Min. : 0 C985934102 : 98 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 C1286084959: 96 1st Qu.: 0
## Median : 16002 Median : 0 C1590550415: 89 Median : 126377
## Mean : 874010 Mean : 893809 C248609774 : 88 Mean : 978160
## 3rd Qu.: 136642 3rd Qu.: 174600 C665576141 : 87 3rd Qu.: 915923
## Max. :38900000 Max. :38900000 C2083562754: 86 Max. :42100000
## (Other) :1048031
## NewBalanceDest isFraud isFlaggedFraud
## Min. : 0 Min. :0.000000 Min. :0
## 1st Qu.: 0 1st Qu.:0.000000 1st Qu.:0
## Median : 218260 Median :0.000000 Median :0
## Mean : 1114198 Mean :0.001089 Mean :0
## 3rd Qu.: 1149808 3rd Qu.:0.000000 3rd Qu.:0
## Max. :42200000 Max. :1.000000 Max. :0
##
Exploratory Data Analysis
#Finding missing values
colSums(is.na(rawdata))
## Step Type Amt NameOrig OldBalanceOrig
## 0 0 0 0 0
## NewBalanceOrig NameDest OldBalanceDest NewBalanceDest isFraud
## 0 0 0 0 0
## isFlaggedFraud
## 0
There are no missing values in the data.
#No. of fraud records
rawdata %>% count(isFraud)
## # A tibble: 2 x 2
## isFraud n
## <int> <int>
## 1 0 1047433
## 2 1 1142
#percent of fraudulent records
prop.table(table(rawdata$isFraud))*100
##
## 0 1
## 99.8910903 0.1089097
There are 1142 fraud records which is approximately 1.1% in the dataset.
#barplot of fraud vs non-fraud
barplot(prop.table(table(rawdata$isFraud)),
names.arg = c('not fraud', 'fraud'),
ylab = 'No of Transactions', main="Fraud vs Not Fraud", col='light pink', ylim = c(0,1.0))
#no.of transactions per Type
rawdata %>% count(Type) %>% arrange(desc(n))
## # A tibble: 5 x 2
## Type n
## <fct> <int>
## 1 CASH_OUT 373641
## 2 PAYMENT 353873
## 3 CASH_IN 227130
## 4 TRANSFER 86753
## 5 DEBIT 7178
rawdata %>% ggplot(aes(x=Type, fill = Type)) +
geom_bar() +
labs(title = "Transactions as per Type", x = 'Transaction Type' , y = 'No of transactions' ) +
theme_classic()
We can see in the graph, more no. of transactions are from CASH_OUT and PAYMENT types in total dataset.
#Finding out the category/type from which more no. of fraud transactions took place
rawdata %>% group_by(Type) %>% summarise(fraud_transactions = sum(isFraud))
## # A tibble: 5 x 2
## Type fraud_transactions
## <fct> <int>
## 1 CASH_IN 0
## 2 CASH_OUT 578
## 3 DEBIT 0
## 4 PAYMENT 0
## 5 TRANSFER 564
fraud_trans <- rawdata %>% group_by(Type) %>% summarise(fraud_transactions = sum(isFraud))
ggplot(fraud_trans, aes(x = Type, y = fraud_transactions)) +
geom_col(aes(fill = 'Type'), show.legend = FALSE) +
labs(title = 'Fraud transactions as Per type', x = 'Transcation type', y = 'No of Fraud Transactions') +
geom_text(aes(label = fraud_transactions), size = 4, hjust =.5) +
theme_classic()
The above plot shows that all the fraud transactions are of CASH_OUT & TRANSFER types.
#frequency distribution of amount in fradulent transactions
ggplot(rawdata[rawdata$isFraud==1, ], aes(x = Amt, fill = Amt)) +
geom_histogram(bins = 40, aes(fill = 'Amt')) +
labs(title = 'Fraud transaction Amount distribution', y = 'No. of Fraud transacts', x = 'Amount in Dollars')
The graph shows that the distribution is positively skewed. It means most of the fraud transactions are of lesser amount.
#Boxplot
ggplot(rawdata, aes(x = factor(isFraud) ,y = log(Amt), fill = factor(isFraud))) +
geom_boxplot(show.legend = FALSE) +
labs(title= 'Amount- Boxplot' , x = 'isFraud') +
theme_classic()
The above boxplot shows that the money involved in fraud transactions is greater than that of in legitimate transactions.
#Type of transaction used in normal transactions
ggplot(rawdata[rawdata$isFraud==0,], aes(x = Type, y = log(Amt), fill = Type)) +
geom_boxplot(show.legend = FALSE) +
labs(title= 'Amount- Boxplot' , x = 'Types') +
theme_classic()
The above boxplot shows that type Transfer is used in large scale followed by Cash_Out and Cash_In and the least being type Debit in legitimate transactions.
#comparing old origin balance with old destination balance
origBal <- ggplot(rawdata, aes(x = factor(isFraud), y = log1p(OldBalanceOrig), fill = factor(isFraud))) +
geom_boxplot(show.legend = FALSE) +
labs(title= 'Old balance in Origin Accounts', x = 'isFraud') +
theme_classic()
destBal <- ggplot(rawdata, aes(x = factor(isFraud), y = log1p(OldBalanceDest), fill = factor(isFraud))) +
geom_boxplot(show.legend = FALSE) +
labs(title= 'Old balance in Destination Accounts' , x = 'isFraud') +
theme_classic()
grid.arrange(origBal, destBal, nrow = 1)
In majority of the fraud transactions, the Old balance of the Origin account from which the payments are made is higher than the rest of the origin accounts while the Old balance in Destination accounts is lower than rest.
#Distribution of transactions at different time intervals
timeInt <- ggplot(rawdata, aes(x = Step)) +
geom_histogram(bins = 150, aes(fill = 'isFraud'), show.legend = FALSE) +
labs(title='Total transactions at different time interval', y='No. of transactions') +
theme_classic()
fradtimInt <- ggplot(rawdata[rawdata$isFraud==1,], aes(x = Step)) +
geom_histogram(bins =150, aes(fill = 'isFraud'), show.legend = FALSE) +
theme_classic() +
labs(title= 'Fraud transactions at different time interval', y='No.of fraud transactions')
grid.arrange(timeInt, fradtimInt, nrow=2, ncol=1)
After step 50 the total no. of transactions are very low, but the fraud transactions don’t seem to be going down.
To do more research on time interval, converting Step feature into 24-hour format for better understanding.
summary(rawdata$Step)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 20.00 26.97 39.00 95.00
#Creating a new Hour variable using Step
rawdata$Hour <- mod(rawdata$Step, 24)
table(rawdata$Hour)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 8923 5078 1482 615 581 689 1682 6871 21137 61260 66925 66430 75955
## 13 14 15 16 17 18 19 20 21 22 23
## 71543 72964 68014 76767 79735 90923 96426 79176 42116 35458 17825
Now we can see no. of transactions took place at each hour on a 24-hour basis.
#comparing the transactions on a hourly basis between total no. of records and fraud records
timeIntHr <- ggplot(rawdata, aes(x = Hour)) +
geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +
labs(title='Total transactions at different Hours', y='No. of transactions') +
theme_classic()
fradtimIntHr <- ggplot(rawdata[rawdata$isFraud==1,], aes(x = Hour)) +
geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +
labs(title= 'Fraud transactions at different Hours', y = 'No. of fraud transactions') +
theme_classic()
grid.arrange(timeIntHr, fradtimIntHr, ncol = 1)
Look at the graphs now, the no. of transactions took place between 0 and 9 hours are very low in total dataset whereas in fraud records transactions took place throughout the clock.
#to know whether transaction amount is greater than the available balance
head(rawdata[(rawdata$Amt > rawdata$OldBalanceOrig) &
(rawdata$NewBalanceDest > rawdata$OldBalanceDest),
c("Amt","OldBalanceOrig","NewBalanceOrig","OldBalanceDest", "NewBalanceDest", "isFraud")], 10)
## Amt OldBalanceOrig NewBalanceOrig OldBalanceDest NewBalanceDest
## 11 9644.94 4465.00 0 10845 157982.12
## 16 229133.94 15325.00 0 5083 51513.44
## 25 311685.89 10835.00 0 6267 2719172.89
## 49 5346.89 0.00 0 652637 6453430.91
## 73 94253.33 25203.05 0 99773 965870.05
## 82 78766.03 0.00 0 103772 277515.05
## 84 125872.53 0.00 0 348512 3420103.09
## 85 379856.23 0.00 0 900180 19200000.00
## 86 1505626.01 0.00 0 29031 5515763.34
## 89 761507.39 0.00 0 1280036 19200000.00
## isFraud
## 11 0
## 16 0
## 25 0
## 49 0
## 73 0
## 82 0
## 84 0
## 85 0
## 86 0
## 89 0
The above 10 records show where the transaction Amount was greater than the balance available in the Origin account.
Hence let’s create a new feature called ‘AdjBalOrig’ and ‘AdjBalDest’.
#Creating new features
#AdjBalOrig = NewBalanceOrig + Amt - OldBalanceOrig
#AdjBalDest = OldBalanceDest + Amt - NewBalanceDest
rawdata$AdjBalOrig <- round(rawdata$NewBalanceOrig + rawdata$Amt - rawdata$OldBalanceOrig, 2)
rawdata$AdjBalDest <- round(rawdata$OldBalanceDest + rawdata$Amt - rawdata$NewBalanceDest, 2)
head(rawdata)
## Step Type Amt NameOrig OldBalanceOrig NewBalanceOrig NameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860 46042.29 M573487274
## OldBalanceDest NewBalanceDest isFraud isFlaggedFraud Hour AdjBalOrig
## 1 0 0 0 0 1 0
## 2 0 0 0 0 1 0
## 3 0 0 1 0 1 0
## 4 21182 0 1 0 1 0
## 5 0 0 0 0 1 0
## 6 0 0 0 0 1 0
## AdjBalDest
## 1 9839.64
## 2 1864.28
## 3 181.00
## 4 21363.00
## 5 11668.14
## 6 7817.71
As we have seen above, fraud transactions occurred only with Cash_Out & Transfer types, we would extract only these two-type records for modelling.
#creating new dataset with filtering and subsetting
rawdata.filt <- rawdata %>% select(-c('Step','NameOrig', 'NameDest', 'isFlaggedFraud')) %>%
filter(Type %in% c('CASH_OUT', 'TRANSFER'))
table(rawdata.filt$Type)
##
## CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
## 0 373641 0 0 86753
Extracted records which have Cash_Out and Transfer type transactions only.
#one-hot encoding the Type variable
rawdata.filt.dum <- dummy_cols(rawdata.filt)
str(rawdata.filt.dum)
## 'data.frame': 460394 obs. of 15 variables:
## $ Type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 5 2 2 5 5 2 2 2 2 5 ...
## $ Amt : num 181 181 229134 215310 311686 ...
## $ OldBalanceOrig: num 181 181 15325 705 10835 ...
## $ NewBalanceOrig: num 0 0 0 0 0 ...
## $ OldBalanceDest: num 0 21182 5083 22425 6267 ...
## $ NewBalanceDest: num 0 0 51513 0 2719173 ...
## $ isFraud : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Hour : num 1 1 1 1 1 1 1 1 1 1 ...
## $ AdjBalOrig : num 0 0 213809 214605 300851 ...
## $ AdjBalDest : num 181 21363 182704 237735 -2401220 ...
## $ Type_CASH_IN : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Type_CASH_OUT : int 0 1 1 0 0 1 1 1 1 0 ...
## $ Type_DEBIT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Type_PAYMENT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Type_TRANSFER : int 1 0 0 1 1 0 0 0 0 1 ...
Created dummy variables for Type variables.
#Dropping unnecessary features for further analysis
rawdata.filt.dum <- rawdata.filt.dum %>%
select(-c('Type','Type_CASH_IN', 'Type_DEBIT', 'Type_PAYMENT'))
str(rawdata.filt.dum)
## 'data.frame': 460394 obs. of 11 variables:
## $ Amt : num 181 181 229134 215310 311686 ...
## $ OldBalanceOrig: num 181 181 15325 705 10835 ...
## $ NewBalanceOrig: num 0 0 0 0 0 ...
## $ OldBalanceDest: num 0 21182 5083 22425 6267 ...
## $ NewBalanceDest: num 0 0 51513 0 2719173 ...
## $ isFraud : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Hour : num 1 1 1 1 1 1 1 1 1 1 ...
## $ AdjBalOrig : num 0 0 213809 214605 300851 ...
## $ AdjBalDest : num 181 21363 182704 237735 -2401220 ...
## $ Type_CASH_OUT : int 0 1 1 0 0 1 1 1 1 0 ...
## $ Type_TRANSFER : int 1 0 0 1 1 0 0 0 0 1 ...
#convert isFraud variable into factor variable for developing model
rawdata.filt.dum$isFraud <- as.factor(rawdata.filt.dum$isFraud)
str(rawdata.filt.dum)
## 'data.frame': 460394 obs. of 11 variables:
## $ Amt : num 181 181 229134 215310 311686 ...
## $ OldBalanceOrig: num 181 181 15325 705 10835 ...
## $ NewBalanceOrig: num 0 0 0 0 0 ...
## $ OldBalanceDest: num 0 21182 5083 22425 6267 ...
## $ NewBalanceDest: num 0 0 51513 0 2719173 ...
## $ isFraud : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
## $ Hour : num 1 1 1 1 1 1 1 1 1 1 ...
## $ AdjBalOrig : num 0 0 213809 214605 300851 ...
## $ AdjBalDest : num 181 21363 182704 237735 -2401220 ...
## $ Type_CASH_OUT : int 0 1 1 0 0 1 1 1 1 0 ...
## $ Type_TRANSFER : int 1 0 0 1 1 0 0 0 0 1 ...
For developing classification model, the dependent variable has to be factor variable.
head(rawdata.filt.dum)
## Amt OldBalanceOrig NewBalanceOrig OldBalanceDest NewBalanceDest isFraud
## 1 181.0 181.00 0 0 0.00 1
## 2 181.0 181.00 0 21182 0.00 1
## 3 229133.9 15325.00 0 5083 51513.44 0
## 4 215310.3 705.00 0 22425 0.00 0
## 5 311685.9 10835.00 0 6267 2719172.89 0
## 6 110414.7 26845.41 0 288800 2415.16 0
## Hour AdjBalOrig AdjBalDest Type_CASH_OUT Type_TRANSFER
## 1 1 0.0 181.0 0 1
## 2 1 0.0 21363.0 1 0
## 3 1 213808.9 182703.5 1 0
## 4 1 214605.3 237735.3 0 1
## 5 1 300850.9 -2401220.0 0 1
## 6 1 83569.3 396799.5 1 0
Final data after all modifications
Developing a prediction model
#splitting the data into train and test
set.seed(123)
ind <- sample(2, nrow(rawdata.filt.dum), replace = T, prob = c(.7,.3))
train <- rawdata.filt.dum[ind==1,]
test <- rawdata.filt.dum[ind==2,]
dim(train)
## [1] 322075 11
dim(test)
## [1] 138319 11
We are going to use 70% from filtered dataset to train the model and remaining 30% to validate the model.
Train dataset has 322075 records which will be used to develop a prediction model. Test dataset has 138319 records which will be used to test the built model.
#View 10 records of train data
head(train, 10)
## Amt OldBalanceOrig NewBalanceOrig OldBalanceDest NewBalanceDest
## 1 181.00 181.00 0.00 0.00 0.00
## 3 229133.94 15325.00 0.00 5083.00 51513.44
## 6 110414.71 26845.41 0.00 288800.00 2415.16
## 7 56953.90 1942.02 0.00 70253.00 64106.18
## 9 23261.30 20411.53 0.00 25742.00 0.00
## 10 62610.80 79114.00 16503.20 517.00 8383.29
## 12 47458.86 209534.84 162075.98 52120.00 0.00
## 13 136872.92 162075.98 25203.05 217806.00 0.00
## 14 94253.33 25203.05 0.00 99773.00 965870.05
## 15 42712.39 10363.39 0.00 57901.66 24044.18
## isFraud Hour AdjBalOrig AdjBalDest Type_CASH_OUT Type_TRANSFER
## 1 1 1 0.00 181.00 0 1
## 3 0 1 213808.94 182703.50 1 0
## 6 0 1 83569.30 396799.55 1 0
## 7 0 1 55011.88 63100.72 1 0
## 9 0 1 2849.77 49003.30 1 0
## 10 0 1 0.00 54744.51 0 1
## 12 0 1 0.00 99578.86 1 0
## 13 0 1 -0.01 354678.92 1 0
## 14 0 1 69050.28 -771843.72 1 0
## 15 0 1 32349.00 76569.87 0 1
Extracting first 10 records of train dataset so as to know how the data looks like after all modifications.
#Developing a model on train data with Random Forest algorithm
library(randomForest)
fit.forest <- randomForest(isFraud ~ ., data = train, ntree=50)
randomForest() by default develops 500 trees, to avoid consuming more time tried with 50 trees
#Print the model
print(fit.forest)
##
## Call:
## randomForest(formula = isFraud ~ ., data = train, ntree = 50)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.01%
## Confusion matrix:
## 0 1 class.error
## 0 321260 0 0.00000000
## 1 22 793 0.02699387
Here we could see no. of trees as 50 that the model has used, number variables used at each split, OOB error rate of 0.01%, which means with this training data set model achieved about 99% accuracy.
#prediction on training data
p1 <- predict(fit.forest, train)
#Confusion Matrix on train data
library(caret)
confusionMatrix(train$isFraud, p1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 321260 0
## 1 3 812
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.9975
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9982
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9963
## Prevalence : 0.9975
## Detection Rate : 0.9975
## Detection Prevalence : 0.9975
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
Results of confusioin matrix show that 321260 were actually belonged to class legitimate transactions and model also predicted them to be class normal transactions which is correct classification; and about 812 transactions were actually belonged to fraudulent transactions and model also predicted them to be fraud records, which is also correct classification.
But there were 3 records which were actually belonged to class normal transactions but the model predicted them to be fraudulent transactions, which is missclassification error.
And for fraudulent records the above model predicted 100% accurate but this was based on training data which records model had already seen.
With the above confusion matrix on training dataset, the model achieved almost 100% accuracy. But need to confirm this with the unseen data.
#prediction on test data
p2 <- predict(fit.forest, test)
#confusion matrix on test data
confusionMatrix(test$isFraud, p2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 137991 1
## 1 6 321
##
## Accuracy : 0.9999
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9977
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9892
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 1.0000
## Specificity : 0.9969
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9817
## Prevalence : 0.9977
## Detection Rate : 0.9976
## Detection Prevalence : 0.9976
## Balanced Accuracy : 0.9984
##
## 'Positive' Class : 0
##
This confusion matrix is based on test data which is the data that model had not seen.
Based on the above results, there were about 137991 records which were belonged to normal transaction class and model also predicted them to be normal records which are correct classification. And about 321 records which were belonged to fraudulent transactions and model also predicted them to be fraud records which are correct classification.
But about 6 records which were actually belonged to class normal transactions but the model predicted them to be fraud records which is incorrect prediction. There is only one record in the prediction which was actually belonged to fraudulent transaction class, but the model predicted it as normal transaction, which is also misclassification error.
In all, the above model achieved about 99.9% accuracy on the test data which is pretty good but it’s not the case in real time world.
#Tuning parameters
#optimal no. of trees
plot(fit.forest)
The above plot of the model shows that as the number of trees getting increased, there is a gradual drop in the error rate. And approximately with 35 trees the error got more or less stabilized.
As the above random forest model was built using with 50 trees, we can tune it to 35 trees would be sufficient.
#optimal mtry value
t <- tuneRF(train[,-6], train[,6],
stepFactor = .5,
plot = TRUE,
trace = TRUE,
ntreeTry = 35,
improve = 0.05)
## mtry = 3 OOB error = 0.01%
## Searching left ...
## mtry = 6 OOB error = 0.01%
## 0.1538462 0.05
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## invalid mtry: reset to within valid range
## mtry = 12 OOB error = 0.01%
## -0.5909091 0.05
## Searching right ...
## mtry = 1 OOB error = 0.19%
## -26.31818 0.05
The above tuning parameter is for mtry, the above plot shows mtry=3 would be sufficient enough to achieve optimal accuracy.
The above random forest was built with mtry=3 as it used the default value which is square root of features.
We are not tuning any parameter in the above model as the model achieved 99.9% accuracy on test data.
hist(treesize(fit.forest),
main = "No. of Nodes for the Trees",
col = "skyblue")
#feature importance
fit.forest$importance
## MeanDecreaseGini
## Amt 145.43907
## OldBalanceOrig 101.43167
## NewBalanceOrig 717.41427
## OldBalanceDest 92.48587
## NewBalanceDest 97.82512
## Hour 41.02312
## AdjBalOrig 184.37075
## AdjBalDest 80.23465
## Type_CASH_OUT 82.91423
## Type_TRANSFER 83.10585
varImpPlot(fit.forest)
importance_matrix <- data.frame(Variables = rownames(fit.forest$importance), fit.forest$importance, row.names = NULL)
ggplot(importance_matrix, aes(y = MeanDecreaseGini , x = Variables, fill = Variables))+
geom_col() + coord_flip() +
labs(title= 'Variiable importance plot')+
theme_classic()
With the above model, the feature NewBalanceOrig which is ‘new balance after transaction’ in the given data is playing a significant role in predicting the fraudulent records with high mean decrease gini index value at 721.09, followed by AdjBalOrg, Amt and so on.
With the above model, we achieved approximately 99.9% accuracy on test data by using Fandom Forest algorithm.
Building another model on the same data using Extreme Gradient Boosting algorithm.
#Extreme Gradient Boosting algorithm
library(xgboost)
library(Matrix)
#splitting the modified dataset into train and test
set.seed(123)
ind <- sample(2, nrow(rawdata.filt.dum), replace = T, prob = c(.7,.3))
train <- rawdata.filt.dum[ind==1,]
test <- rawdata.filt.dum[ind==2,]
dim(train)
## [1] 322075 11
dim(test)
## [1] 138319 11
#converting independent features of train and test matrices into datamatrix format
x_train <- as.matrix(train[,-6])
y_train <- as.numeric(as.character(train$isFraud))
x_test <- as.matrix(test[,-6])
y_test <- as.numeric(as.character(test$isFraud))
trainDM <- xgb.DMatrix(data=x_train, label=y_train)
testDM <- xgb.DMatrix(data=x_test, label=y_test)
str(trainDM)
## Class 'xgb.DMatrix' <externalptr>
## - attr(*, ".Dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:10] "Amt" "OldBalanceOrig" "NewBalanceOrig" "OldBalanceDest" ...
#setting the parameter grid
param <- list(objective="binary:logistic",
eval_metric="error",
booster="gbtree",
max_depth=2)
set.seed(12)
#extreme gradient cross-validation
xgb_cv <- xgb.cv(params = param,
data = trainDM,
nrounds = 20,
nfold = 10,
metrics = 'error')
## [1] train-error:0.002350+0.000031 test-error:0.002354+0.000269
## [2] train-error:0.002302+0.000048 test-error:0.002307+0.000307
## [3] train-error:0.002133+0.000044 test-error:0.002164+0.000276
## [4] train-error:0.002080+0.000070 test-error:0.002117+0.000234
## [5] train-error:0.001421+0.000572 test-error:0.001468+0.000677
## [6] train-error:0.000539+0.000659 test-error:0.000580+0.000800
## [7] train-error:0.000422+0.000432 test-error:0.000422+0.000482
## [8] train-error:0.000157+0.000250 test-error:0.000171+0.000298
## [9] train-error:0.000073+0.000006 test-error:0.000071+0.000039
## [10] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [11] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [12] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [13] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [14] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [15] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [16] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [17] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [18] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [19] train-error:0.000074+0.000004 test-error:0.000074+0.000040
## [20] train-error:0.000074+0.000004 test-error:0.000074+0.000040
Here We can see the train and test error rates at each iteration. And from 10th iteration onwards the error got more or less constant.
xgb_cv$evaluation_log %>%
ggplot(aes(x=iter, y=train_error_mean, color='blue')) +
geom_line()+
geom_line(aes(y=test_error_mean, color="red"))+
scale_color_manual(labels=c("Train error", "Test error"), values=c('blue', 'red'))
The optimal no. of rounds given by the xgboost seems to be 11 Rounds where the error on both train and validation set is minimum, also the model doesn’t seems to be overfitting, as both the train and validation error follows the same trend.
#fitting the model with watchlist on train and test datasets
fit <- xgb.train(trainDM, params = param, nrounds = 10,
watchlist = list(train_set=trainDM, test_set=testDM))
## [1] train_set-error:0.002350 test_set-error:0.002220
## [2] train_set-error:0.002344 test_set-error:0.002212
## [3] train_set-error:0.002155 test_set-error:0.002024
## [4] train_set-error:0.002155 test_set-error:0.002039
## [5] train_set-error:0.000941 test_set-error:0.000925
## [6] train_set-error:0.000062 test_set-error:0.000051
## [7] train_set-error:0.000075 test_set-error:0.000051
## [8] train_set-error:0.000075 test_set-error:0.000051
## [9] train_set-error:0.000075 test_set-error:0.000051
## [10] train_set-error:0.000075 test_set-error:0.000051
fit$evaluation_log %>%
ggplot(aes(x=iter, y=train_set_error, color='blue')) +
geom_line()+
geom_line(aes(y=test_set_error, color="red"))+
scale_color_manual(labels=c("Train error", "Test error"), values=c('blue', 'red'))
This above graph shows train and test errors at each iteration. Initially the error was a bit high for both train and test datasets, but then gradually came down and got more or less stabilized. Another thing to be noted is the test error is less than train error throughout the iterations, which means there is no presence of overfitting.
#predict train data
pred_xgb_train <- predict(fit, trainDM)
train_pred_class <- ifelse(pred_xgb_train>0.5,1,0)
confusionMatrix(as.factor(y_train), as.factor(train_pred_class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 321260 0
## 1 24 791
##
## Accuracy : 0.9999
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9975
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.985
##
## Mcnemar's Test P-Value : 2.668e-06
##
## Sensitivity : 0.9999
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9706
## Prevalence : 0.9975
## Detection Rate : 0.9975
## Detection Prevalence : 0.9975
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
The above confusion matrix is based on train data which is the data that model had seen.
Based on the above results, there were about 321260 records which were belonged to normal transaction class and model also predicted them to be normal records which are correct classification. And about 791 records which were belonged to fraudulent transactions and model also predicted them to be fraud records which are correct classification.
But there were 24 records which were actually belonged to class normal transactions but the model predicted them to be fraud records which is incorrect prediction. For fraudulent records, this model had predicted 100% accurately with the train data.
#predict test data
pred_xgb_test <- predict(fit, testDM)
test_pred_class <- ifelse(pred_xgb_test>0.5,1,0)
confusionMatrix(as.factor(y_test), as.factor(test_pred_class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 137991 1
## 1 6 321
##
## Accuracy : 0.9999
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9977
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9892
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 1.0000
## Specificity : 0.9969
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9817
## Prevalence : 0.9977
## Detection Rate : 0.9976
## Detection Prevalence : 0.9976
## Balanced Accuracy : 0.9984
##
## 'Positive' Class : 0
##
This confusion matrix is based on test data which is the data that model had not seen.
Based on the above results, there were about 137991 records which were belonged to normal transaction class and model also predicted them to be normal records which are correct classification. And about 321 records which were belonged to fraudulent transactions and model also predicted them to be fraud records which are correct classification.
But about 6 records which were actually belonged to class normal transactions but the model predicted them to be fraud records which is incorrect prediction. There is only one record in the prediction which was actually belonged to fraudulent transaction class, but the model predicted it as normal transaction, which is also misclassification error.
Also with Extreme Gradient Boosting algorithm, the model achieved approximately 99.9% accuracy on test data.
#variable importance per xgboost algorithm
xgb_impvar <- xgb.importance(feature_names = names(train[,-6]),model = fit)
xgb.ggplot.importance(xgb_impvar)
With this model, the feature NewBalanceOrig is the most important variable which is contributing significantly in predicting the fraudulent records.