We are using a Synthetic Financial Dataset for Fraud Detection which is a simulator generated dataset of mobile money transactions generated for fraud detection research. The dataset has more than 10 lakh observations with 11 features.

Details of features:-

Step - maps a unit of time in the real world. In this case 1 step is 1 hour of time.

Type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

Amt - amount of the transaction in local currency.

NameOrig - customer who started the transaction

OldBalanceOrg - initial balance before the transactions

NewBalanceOrig - new balance after the transaction

NameDest - customer who is the recipient of the transaction

OldBalanceDest - initial balance of recipient before the transaction.

NewBalanceDest - new balance of recipient after the transaction.

isFraud - This is the transactions made by the fraudulent agents inside the simulation.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction

#libraries used
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(caTools)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rpart)
library(rpart.plot)
library(Matrix)
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(magrittr)
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 3.6.2
library(Ckmeans.1d.dp)
#Import data into r environment
rawdata <- read.csv('C:/Myfolder/SyntheticData/SynData.csv')
#first 10 observations
head(rawdata, 10)
##    Step     Type      Amt    NameOrig OldBalanceOrig NewBalanceOrig    NameDest
## 1     1  PAYMENT  9839.64 C1231006815       170136.0      160296.36 M1979787155
## 2     1  PAYMENT  1864.28 C1666544295        21249.0       19384.72 M2044282225
## 3     1 TRANSFER   181.00 C1305486145          181.0           0.00  C553264065
## 4     1 CASH_OUT   181.00  C840083671          181.0           0.00   C38997010
## 5     1  PAYMENT 11668.14 C2048537720        41554.0       29885.86 M1230701703
## 6     1  PAYMENT  7817.71   C90045638        53860.0       46042.29  M573487274
## 7     1  PAYMENT  7107.77  C154988899       183195.0      176087.23  M408069119
## 8     1  PAYMENT  7861.64 C1912850431       176087.2      168225.59  M633326333
## 9     1  PAYMENT  4024.36 C1265012928         2671.0           0.00 M1176932104
## 10    1    DEBIT  5337.77  C712410124        41720.0       36382.23  C195600860
##    OldBalanceDest NewBalanceDest isFraud isFlaggedFraud
## 1               0           0.00       0              0
## 2               0           0.00       0              0
## 3               0           0.00       1              0
## 4           21182           0.00       1              0
## 5               0           0.00       0              0
## 6               0           0.00       0              0
## 7               0           0.00       0              0
## 8               0           0.00       0              0
## 9               0           0.00       0              0
## 10          41898       40348.79       0              0
#View structure of data
str(rawdata)
## 'data.frame':    1048575 obs. of  11 variables:
##  $ Step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ Amt           : num  9840 1864 181 181 11668 ...
##  $ NameOrig      : Factor w/ 1048317 levels "C1000001725",..: 125165 361320 165237 961663 567916 994158 298038 494238 143414 891803 ...
##  $ OldBalanceOrig: num  170136 21249 181 181 41554 ...
##  $ NewBalanceOrig: num  160296 19385 0 0 29886 ...
##  $ NameDest      : Factor w/ 449635 levels "C1000015936",..: 274918 286717 73551 65465 138002 371072 340846 381943 128187 47318 ...
##  $ OldBalanceDest: num  0 0 0 21182 0 ...
##  $ NewBalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ isFlaggedFraud: int  0 0 0 0 0 0 0 0 0 0 ...
table(rawdata$Step)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
##  2708  1014   552   565   665  1660  6837 21097 37628 35991 37241 36153 37515 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## 41485 44609 42471 43361 49579 51352 40625 19152 12635  6144  3216  1598   440 
##    27    28    29    30    31    32    33    34    35    36    37    38    39 
##    41     4     4     8    12    12 23616 30904 29157 39774 34000 31453 23391 
##    40    41    42    43    44    45    46    47    48    49    50    51    52 
## 34270 36348 41304 45060 38523 18500 12445  8681  5693   764     6    14     8 
##    53    54    55    56    57    58    59    60    61    62    63    64    65 
##    10     4    12    18     6    20    20    12    14     8     8    18    20 
##    66    67    68    69    70    71    72    73    74    75    76    77    78 
##    24     6    18    20     6    20    14     8    22     8     4    10    10 
##    79    80    81    82    83    84    85    86    87    88    89    90    91 
##    10    10    10    10    12    16    14    18     6     8     6    16     8 
##    92    93    94    95 
##    10  4444 10372  2980
#summary of data
summary(rawdata)
##       Step             Type             Amt                  NameOrig      
##  Min.   : 1.00   CASH_IN :227130   Min.   :       0   C1026010130:      2  
##  1st Qu.:15.00   CASH_OUT:373641   1st Qu.:   12149   C1028315484:      2  
##  Median :20.00   DEBIT   :  7178   Median :   76343   C1039623203:      2  
##  Mean   :26.97   PAYMENT :353873   Mean   :  158667   C1041466144:      2  
##  3rd Qu.:39.00   TRANSFER: 86753   3rd Qu.:  213762   C1052471021:      2  
##  Max.   :95.00                     Max.   :10000000   C1056260641:      2  
##                                                       (Other)    :1048563  
##  OldBalanceOrig     NewBalanceOrig            NameDest       OldBalanceDest    
##  Min.   :       0   Min.   :       0   C985934102 :     98   Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   C1286084959:     96   1st Qu.:       0  
##  Median :   16002   Median :       0   C1590550415:     89   Median :  126377  
##  Mean   :  874010   Mean   :  893809   C248609774 :     88   Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600   C665576141 :     87   3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000   C2083562754:     86   Max.   :42100000  
##                                        (Other)    :1048031                     
##  NewBalanceDest        isFraud         isFlaggedFraud
##  Min.   :       0   Min.   :0.000000   Min.   :0     
##  1st Qu.:       0   1st Qu.:0.000000   1st Qu.:0     
##  Median :  218260   Median :0.000000   Median :0     
##  Mean   : 1114198   Mean   :0.001089   Mean   :0     
##  3rd Qu.: 1149808   3rd Qu.:0.000000   3rd Qu.:0     
##  Max.   :42200000   Max.   :1.000000   Max.   :0     
## 

Exploratory Data Analysis

#Finding missing values 
colSums(is.na(rawdata))
##           Step           Type            Amt       NameOrig OldBalanceOrig 
##              0              0              0              0              0 
## NewBalanceOrig       NameDest OldBalanceDest NewBalanceDest        isFraud 
##              0              0              0              0              0 
## isFlaggedFraud 
##              0

There are no missing values in the data.

#No. of fraud records
rawdata %>% count(isFraud)
## # A tibble: 2 x 2
##   isFraud       n
##     <int>   <int>
## 1       0 1047433
## 2       1    1142
#percent of fraudulent records
prop.table(table(rawdata$isFraud))*100
## 
##          0          1 
## 99.8910903  0.1089097

There are 1142 fraud records which is approximately 1.1% in the dataset.

#barplot of fraud vs non-fraud
barplot(prop.table(table(rawdata$isFraud)),
        names.arg = c('not fraud', 'fraud'),
        ylab = 'No of Transactions', main="Fraud vs Not Fraud", col='light pink', ylim = c(0,1.0))

#no.of transactions per Type
rawdata %>% count(Type) %>% arrange(desc(n))
## # A tibble: 5 x 2
##   Type          n
##   <fct>     <int>
## 1 CASH_OUT 373641
## 2 PAYMENT  353873
## 3 CASH_IN  227130
## 4 TRANSFER  86753
## 5 DEBIT      7178
rawdata %>% ggplot(aes(x=Type, fill = Type)) +
  geom_bar() +
  labs(title = "Transactions as per Type",  x = 'Transaction Type' , y = 'No of transactions' ) +
  theme_classic()

We can see in the graph, more no. of transactions are from CASH_OUT and PAYMENT types in total dataset.

#Finding out the category/type from which more no. of fraud transactions took place
rawdata %>% group_by(Type) %>% summarise(fraud_transactions = sum(isFraud))
## # A tibble: 5 x 2
##   Type     fraud_transactions
##   <fct>                 <int>
## 1 CASH_IN                   0
## 2 CASH_OUT                578
## 3 DEBIT                     0
## 4 PAYMENT                   0
## 5 TRANSFER                564
fraud_trans <- rawdata %>% group_by(Type) %>% summarise(fraud_transactions = sum(isFraud))

ggplot(fraud_trans, aes(x = Type,  y = fraud_transactions)) +
  geom_col(aes(fill = 'Type'), show.legend = FALSE) +
  labs(title = 'Fraud transactions as Per type', x = 'Transcation type', y = 'No of Fraud Transactions') +
  geom_text(aes(label = fraud_transactions), size = 4, hjust =.5) +
  theme_classic()

The above plot shows that all the fraud transactions are of CASH_OUT & TRANSFER types.

#frequency distribution of amount in fradulent transactions
ggplot(rawdata[rawdata$isFraud==1, ], aes(x = Amt,  fill = Amt)) +
  geom_histogram(bins = 40, aes(fill = 'Amt')) +
  labs(title = 'Fraud transaction Amount distribution', y = 'No. of Fraud transacts', x = 'Amount in Dollars')

The graph shows that the distribution is positively skewed. It means most of the fraud transactions are of lesser amount.

#Boxplot
ggplot(rawdata, aes(x = factor(isFraud) ,y = log(Amt), fill = factor(isFraud))) +
  geom_boxplot(show.legend = FALSE) +
  labs(title= 'Amount- Boxplot' , x = 'isFraud') +
  theme_classic()

The above boxplot shows that the money involved in fraud transactions is greater than that of in legitimate transactions.

#Type of transaction used in normal transactions
ggplot(rawdata[rawdata$isFraud==0,], aes(x = Type, y = log(Amt), fill = Type)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title= 'Amount- Boxplot' , x = 'Types') +
  theme_classic()

The above boxplot shows that type Transfer is used in large scale followed by Cash_Out and Cash_In and the least being type Debit in legitimate transactions.

#comparing old origin balance with old destination balance
origBal <- ggplot(rawdata, aes(x = factor(isFraud), y = log1p(OldBalanceOrig), fill = factor(isFraud))) +
            geom_boxplot(show.legend = FALSE) +
            labs(title= 'Old balance in Origin Accounts', x = 'isFraud') +
            theme_classic()

destBal <- ggplot(rawdata, aes(x = factor(isFraud), y = log1p(OldBalanceDest), fill = factor(isFraud))) +
            geom_boxplot(show.legend = FALSE) +
            labs(title= 'Old balance in Destination Accounts' , x = 'isFraud') +
            theme_classic()

grid.arrange(origBal, destBal, nrow = 1)

In majority of the fraud transactions, the Old balance of the Origin account from which the payments are made is higher than the rest of the origin accounts while the Old balance in Destination accounts is lower than rest.

#Distribution of transactions at different time intervals
timeInt <- ggplot(rawdata, aes(x = Step)) +
            geom_histogram(bins = 150, aes(fill = 'isFraud'), show.legend = FALSE) +
            labs(title='Total transactions at different time interval', y='No. of transactions') + 
            theme_classic()

fradtimInt <- ggplot(rawdata[rawdata$isFraud==1,], aes(x = Step)) +
               geom_histogram(bins =150, aes(fill = 'isFraud'), show.legend = FALSE) +
               theme_classic() +
               labs(title= 'Fraud transactions at different time interval', y='No.of fraud transactions')

grid.arrange(timeInt, fradtimInt, nrow=2, ncol=1)

After step 50 the total no. of transactions are very low, but the fraud transactions don’t seem to be going down.

To do more research on time interval, converting Step feature into 24-hour format for better understanding.

summary(rawdata$Step)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   20.00   26.97   39.00   95.00
#Creating a new Hour variable using Step
rawdata$Hour <- mod(rawdata$Step, 24)

table(rawdata$Hour)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##  8923  5078  1482   615   581   689  1682  6871 21137 61260 66925 66430 75955 
##    13    14    15    16    17    18    19    20    21    22    23 
## 71543 72964 68014 76767 79735 90923 96426 79176 42116 35458 17825

Now we can see no. of transactions took place at each hour on a 24-hour basis.

#comparing the transactions on a hourly basis between total no. of records and fraud records

timeIntHr <- ggplot(rawdata, aes(x = Hour)) +
              geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +
              labs(title='Total transactions at different Hours', y='No. of transactions') +
              theme_classic()

fradtimIntHr <- ggplot(rawdata[rawdata$isFraud==1,], aes(x = Hour)) +
                 geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +
                 labs(title= 'Fraud transactions at different Hours', y = 'No. of fraud transactions') +
                 theme_classic()

grid.arrange(timeIntHr, fradtimIntHr, ncol = 1)

Look at the graphs now, the no. of transactions took place between 0 and 9 hours are very low in total dataset whereas in fraud records transactions took place throughout the clock.

#to know whether transaction amount is greater than the available balance
head(rawdata[(rawdata$Amt > rawdata$OldBalanceOrig) &
             (rawdata$NewBalanceDest > rawdata$OldBalanceDest),
             c("Amt","OldBalanceOrig","NewBalanceOrig","OldBalanceDest", "NewBalanceDest", "isFraud")], 10)
##           Amt OldBalanceOrig NewBalanceOrig OldBalanceDest NewBalanceDest
## 11    9644.94        4465.00              0          10845      157982.12
## 16  229133.94       15325.00              0           5083       51513.44
## 25  311685.89       10835.00              0           6267     2719172.89
## 49    5346.89           0.00              0         652637     6453430.91
## 73   94253.33       25203.05              0          99773      965870.05
## 82   78766.03           0.00              0         103772      277515.05
## 84  125872.53           0.00              0         348512     3420103.09
## 85  379856.23           0.00              0         900180    19200000.00
## 86 1505626.01           0.00              0          29031     5515763.34
## 89  761507.39           0.00              0        1280036    19200000.00
##    isFraud
## 11       0
## 16       0
## 25       0
## 49       0
## 73       0
## 82       0
## 84       0
## 85       0
## 86       0
## 89       0

The above 10 records show where the transaction Amount was greater than the balance available in the Origin account.

Hence let’s create a new feature called ‘AdjBalOrig’ and ‘AdjBalDest’.

#Creating new features
#AdjBalOrig = NewBalanceOrig + Amt - OldBalanceOrig
#AdjBalDest = OldBalanceDest + Amt - NewBalanceDest

rawdata$AdjBalOrig <- round(rawdata$NewBalanceOrig + rawdata$Amt - rawdata$OldBalanceOrig, 2)

rawdata$AdjBalDest <- round(rawdata$OldBalanceDest + rawdata$Amt - rawdata$NewBalanceDest, 2)

head(rawdata)
##   Step     Type      Amt    NameOrig OldBalanceOrig NewBalanceOrig    NameDest
## 1    1  PAYMENT  9839.64 C1231006815         170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295          21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145            181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671            181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720          41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638          53860       46042.29  M573487274
##   OldBalanceDest NewBalanceDest isFraud isFlaggedFraud Hour AdjBalOrig
## 1              0              0       0              0    1          0
## 2              0              0       0              0    1          0
## 3              0              0       1              0    1          0
## 4          21182              0       1              0    1          0
## 5              0              0       0              0    1          0
## 6              0              0       0              0    1          0
##   AdjBalDest
## 1    9839.64
## 2    1864.28
## 3     181.00
## 4   21363.00
## 5   11668.14
## 6    7817.71

As we have seen above, fraud transactions occurred only with Cash_Out & Transfer types, we would extract only these two-type records for modelling.

#creating new dataset with filtering and subsetting
rawdata.filt <- rawdata %>% select(-c('Step','NameOrig', 'NameDest', 'isFlaggedFraud')) %>%
                    filter(Type %in% c('CASH_OUT', 'TRANSFER'))

table(rawdata.filt$Type)
## 
##  CASH_IN CASH_OUT    DEBIT  PAYMENT TRANSFER 
##        0   373641        0        0    86753

Extracted records which have Cash_Out and Transfer type transactions only.

#one-hot encoding the Type variable
rawdata.filt.dum <- dummy_cols(rawdata.filt)

str(rawdata.filt.dum)
## 'data.frame':    460394 obs. of  15 variables:
##  $ Type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 5 2 2 5 5 2 2 2 2 5 ...
##  $ Amt           : num  181 181 229134 215310 311686 ...
##  $ OldBalanceOrig: num  181 181 15325 705 10835 ...
##  $ NewBalanceOrig: num  0 0 0 0 0 ...
##  $ OldBalanceDest: num  0 21182 5083 22425 6267 ...
##  $ NewBalanceDest: num  0 0 51513 0 2719173 ...
##  $ isFraud       : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Hour          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ AdjBalOrig    : num  0 0 213809 214605 300851 ...
##  $ AdjBalDest    : num  181 21363 182704 237735 -2401220 ...
##  $ Type_CASH_IN  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Type_CASH_OUT : int  0 1 1 0 0 1 1 1 1 0 ...
##  $ Type_DEBIT    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Type_PAYMENT  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Type_TRANSFER : int  1 0 0 1 1 0 0 0 0 1 ...

Created dummy variables for Type variables.

#Dropping unnecessary features for further analysis
rawdata.filt.dum <- rawdata.filt.dum %>%
  select(-c('Type','Type_CASH_IN', 'Type_DEBIT', 'Type_PAYMENT'))

str(rawdata.filt.dum)
## 'data.frame':    460394 obs. of  11 variables:
##  $ Amt           : num  181 181 229134 215310 311686 ...
##  $ OldBalanceOrig: num  181 181 15325 705 10835 ...
##  $ NewBalanceOrig: num  0 0 0 0 0 ...
##  $ OldBalanceDest: num  0 21182 5083 22425 6267 ...
##  $ NewBalanceDest: num  0 0 51513 0 2719173 ...
##  $ isFraud       : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ Hour          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ AdjBalOrig    : num  0 0 213809 214605 300851 ...
##  $ AdjBalDest    : num  181 21363 182704 237735 -2401220 ...
##  $ Type_CASH_OUT : int  0 1 1 0 0 1 1 1 1 0 ...
##  $ Type_TRANSFER : int  1 0 0 1 1 0 0 0 0 1 ...
#convert isFraud variable into factor variable for developing model
rawdata.filt.dum$isFraud <- as.factor(rawdata.filt.dum$isFraud)

str(rawdata.filt.dum)
## 'data.frame':    460394 obs. of  11 variables:
##  $ Amt           : num  181 181 229134 215310 311686 ...
##  $ OldBalanceOrig: num  181 181 15325 705 10835 ...
##  $ NewBalanceOrig: num  0 0 0 0 0 ...
##  $ OldBalanceDest: num  0 21182 5083 22425 6267 ...
##  $ NewBalanceDest: num  0 0 51513 0 2719173 ...
##  $ isFraud       : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
##  $ Hour          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ AdjBalOrig    : num  0 0 213809 214605 300851 ...
##  $ AdjBalDest    : num  181 21363 182704 237735 -2401220 ...
##  $ Type_CASH_OUT : int  0 1 1 0 0 1 1 1 1 0 ...
##  $ Type_TRANSFER : int  1 0 0 1 1 0 0 0 0 1 ...

For developing classification model, the dependent variable has to be factor variable.

head(rawdata.filt.dum)
##        Amt OldBalanceOrig NewBalanceOrig OldBalanceDest NewBalanceDest isFraud
## 1    181.0         181.00              0              0           0.00       1
## 2    181.0         181.00              0          21182           0.00       1
## 3 229133.9       15325.00              0           5083       51513.44       0
## 4 215310.3         705.00              0          22425           0.00       0
## 5 311685.9       10835.00              0           6267     2719172.89       0
## 6 110414.7       26845.41              0         288800        2415.16       0
##   Hour AdjBalOrig AdjBalDest Type_CASH_OUT Type_TRANSFER
## 1    1        0.0      181.0             0             1
## 2    1        0.0    21363.0             1             0
## 3    1   213808.9   182703.5             1             0
## 4    1   214605.3   237735.3             0             1
## 5    1   300850.9 -2401220.0             0             1
## 6    1    83569.3   396799.5             1             0

Final data after all modifications

Developing a prediction model

#splitting the data into train and test
set.seed(123)
ind <- sample(2, nrow(rawdata.filt.dum), replace = T, prob = c(.7,.3))

train <- rawdata.filt.dum[ind==1,]
test <- rawdata.filt.dum[ind==2,]

dim(train)
## [1] 322075     11
dim(test)
## [1] 138319     11

We are going to use 70% from filtered dataset to train the model and remaining 30% to validate the model.

Train dataset has 322075 records which will be used to develop a prediction model. Test dataset has 138319 records which will be used to test the built model.

#View 10 records of train data
head(train, 10)
##          Amt OldBalanceOrig NewBalanceOrig OldBalanceDest NewBalanceDest
## 1     181.00         181.00           0.00           0.00           0.00
## 3  229133.94       15325.00           0.00        5083.00       51513.44
## 6  110414.71       26845.41           0.00      288800.00        2415.16
## 7   56953.90        1942.02           0.00       70253.00       64106.18
## 9   23261.30       20411.53           0.00       25742.00           0.00
## 10  62610.80       79114.00       16503.20         517.00        8383.29
## 12  47458.86      209534.84      162075.98       52120.00           0.00
## 13 136872.92      162075.98       25203.05      217806.00           0.00
## 14  94253.33       25203.05           0.00       99773.00      965870.05
## 15  42712.39       10363.39           0.00       57901.66       24044.18
##    isFraud Hour AdjBalOrig AdjBalDest Type_CASH_OUT Type_TRANSFER
## 1        1    1       0.00     181.00             0             1
## 3        0    1  213808.94  182703.50             1             0
## 6        0    1   83569.30  396799.55             1             0
## 7        0    1   55011.88   63100.72             1             0
## 9        0    1    2849.77   49003.30             1             0
## 10       0    1       0.00   54744.51             0             1
## 12       0    1       0.00   99578.86             1             0
## 13       0    1      -0.01  354678.92             1             0
## 14       0    1   69050.28 -771843.72             1             0
## 15       0    1   32349.00   76569.87             0             1

Extracting first 10 records of train dataset so as to know how the data looks like after all modifications.

#Developing a model on train data with Random Forest algorithm
library(randomForest)
fit.forest <- randomForest(isFraud ~ ., data = train, ntree=50)

randomForest() by default develops 500 trees, to avoid consuming more time tried with 50 trees

#Print the model
print(fit.forest)
## 
## Call:
##  randomForest(formula = isFraud ~ ., data = train, ntree = 50) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.01%
## Confusion matrix:
##        0   1 class.error
## 0 321260   0  0.00000000
## 1     22 793  0.02699387

Here we could see no. of trees as 50 that the model has used, number variables used at each split, OOB error rate of 0.01%, which means with this training data set model achieved about 99% accuracy.

#prediction on training data
p1 <- predict(fit.forest, train)
#Confusion Matrix on train data
library(caret)
confusionMatrix(train$isFraud, p1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 321260      0
##          1      3    812
##                                 
##                Accuracy : 1     
##                  95% CI : (1, 1)
##     No Information Rate : 0.9975
##     P-Value [Acc > NIR] : <2e-16
##                                 
##                   Kappa : 0.9982
##                                 
##  Mcnemar's Test P-Value : 0.2482
##                                 
##             Sensitivity : 1.0000
##             Specificity : 1.0000
##          Pos Pred Value : 1.0000
##          Neg Pred Value : 0.9963
##              Prevalence : 0.9975
##          Detection Rate : 0.9975
##    Detection Prevalence : 0.9975
##       Balanced Accuracy : 1.0000
##                                 
##        'Positive' Class : 0     
## 

Results of confusioin matrix show that 321260 were actually belonged to class legitimate transactions and model also predicted them to be class normal transactions which is correct classification; and about 812 transactions were actually belonged to fraudulent transactions and model also predicted them to be fraud records, which is also correct classification.

But there were 3 records which were actually belonged to class normal transactions but the model predicted them to be fraudulent transactions, which is missclassification error.

And for fraudulent records the above model predicted 100% accurate but this was based on training data which records model had already seen.

With the above confusion matrix on training dataset, the model achieved almost 100% accuracy. But need to confirm this with the unseen data.

#prediction on test data
p2 <- predict(fit.forest, test)
#confusion matrix on test data
confusionMatrix(test$isFraud, p2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 137991      1
##          1      6    321
##                                      
##                Accuracy : 0.9999     
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.9977     
##     P-Value [Acc > NIR] : <2e-16     
##                                      
##                   Kappa : 0.9892     
##                                      
##  Mcnemar's Test P-Value : 0.1306     
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 0.9969     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 0.9817     
##              Prevalence : 0.9977     
##          Detection Rate : 0.9976     
##    Detection Prevalence : 0.9976     
##       Balanced Accuracy : 0.9984     
##                                      
##        'Positive' Class : 0          
## 

This confusion matrix is based on test data which is the data that model had not seen.

Based on the above results, there were about 137991 records which were belonged to normal transaction class and model also predicted them to be normal records which are correct classification. And about 321 records which were belonged to fraudulent transactions and model also predicted them to be fraud records which are correct classification.

But about 6 records which were actually belonged to class normal transactions but the model predicted them to be fraud records which is incorrect prediction. There is only one record in the prediction which was actually belonged to fraudulent transaction class, but the model predicted it as normal transaction, which is also misclassification error.

In all, the above model achieved about 99.9% accuracy on the test data which is pretty good but it’s not the case in real time world.

#Tuning parameters
#optimal no. of trees
plot(fit.forest)

The above plot of the model shows that as the number of trees getting increased, there is a gradual drop in the error rate. And approximately with 35 trees the error got more or less stabilized.

As the above random forest model was built using with 50 trees, we can tune it to 35 trees would be sufficient.

#optimal mtry value
t <- tuneRF(train[,-6], train[,6],
       stepFactor = .5,
       plot = TRUE,
       trace = TRUE,
       ntreeTry = 35,
       improve = 0.05)
## mtry = 3  OOB error = 0.01% 
## Searching left ...
## mtry = 6     OOB error = 0.01% 
## 0.1538462 0.05
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## invalid mtry: reset to within valid range
## mtry = 12    OOB error = 0.01% 
## -0.5909091 0.05 
## Searching right ...
## mtry = 1     OOB error = 0.19% 
## -26.31818 0.05

The above tuning parameter is for mtry, the above plot shows mtry=3 would be sufficient enough to achieve optimal accuracy.

The above random forest was built with mtry=3 as it used the default value which is square root of features.

We are not tuning any parameter in the above model as the model achieved 99.9% accuracy on test data.

hist(treesize(fit.forest),
     main = "No. of Nodes for the Trees",
     col = "skyblue")

#feature importance
fit.forest$importance
##                MeanDecreaseGini
## Amt                   145.43907
## OldBalanceOrig        101.43167
## NewBalanceOrig        717.41427
## OldBalanceDest         92.48587
## NewBalanceDest         97.82512
## Hour                   41.02312
## AdjBalOrig            184.37075
## AdjBalDest             80.23465
## Type_CASH_OUT          82.91423
## Type_TRANSFER          83.10585
varImpPlot(fit.forest)

importance_matrix <- data.frame(Variables = rownames(fit.forest$importance), fit.forest$importance, row.names = NULL)

ggplot(importance_matrix, aes(y = MeanDecreaseGini , x = Variables, fill = Variables))+
  geom_col() + coord_flip() +
  labs(title= 'Variiable importance plot')+
  theme_classic()

With the above model, the feature NewBalanceOrig which is ‘new balance after transaction’ in the given data is playing a significant role in predicting the fraudulent records with high mean decrease gini index value at 721.09, followed by AdjBalOrg, Amt and so on.

With the above model, we achieved approximately 99.9% accuracy on test data by using Fandom Forest algorithm.

Building another model on the same data using Extreme Gradient Boosting algorithm.

#Extreme Gradient Boosting algorithm
library(xgboost)
library(Matrix)
#splitting the modified dataset into train and test
set.seed(123)
ind <- sample(2, nrow(rawdata.filt.dum), replace = T, prob = c(.7,.3))

train <- rawdata.filt.dum[ind==1,]
test <- rawdata.filt.dum[ind==2,]

dim(train)
## [1] 322075     11
dim(test)
## [1] 138319     11
#converting independent features of train and test matrices into datamatrix format
x_train <- as.matrix(train[,-6]) 
y_train <- as.numeric(as.character(train$isFraud))

x_test <- as.matrix(test[,-6])
y_test <- as.numeric(as.character(test$isFraud))

trainDM <- xgb.DMatrix(data=x_train, label=y_train)
testDM  <- xgb.DMatrix(data=x_test, label=y_test)

str(trainDM)
## Class 'xgb.DMatrix' <externalptr> 
##  - attr(*, ".Dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:10] "Amt" "OldBalanceOrig" "NewBalanceOrig" "OldBalanceDest" ...
#setting the parameter grid
param <- list(objective="binary:logistic",
              eval_metric="error",
              booster="gbtree",
              max_depth=2)
set.seed(12)

#extreme gradient cross-validation
xgb_cv <- xgb.cv(params = param,
                 data = trainDM,
                 nrounds = 20,
                 nfold = 10,
                 metrics = 'error')
## [1]  train-error:0.002350+0.000031   test-error:0.002354+0.000269 
## [2]  train-error:0.002302+0.000048   test-error:0.002307+0.000307 
## [3]  train-error:0.002133+0.000044   test-error:0.002164+0.000276 
## [4]  train-error:0.002080+0.000070   test-error:0.002117+0.000234 
## [5]  train-error:0.001421+0.000572   test-error:0.001468+0.000677 
## [6]  train-error:0.000539+0.000659   test-error:0.000580+0.000800 
## [7]  train-error:0.000422+0.000432   test-error:0.000422+0.000482 
## [8]  train-error:0.000157+0.000250   test-error:0.000171+0.000298 
## [9]  train-error:0.000073+0.000006   test-error:0.000071+0.000039 
## [10] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [11] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [12] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [13] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [14] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [15] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [16] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [17] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [18] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [19] train-error:0.000074+0.000004   test-error:0.000074+0.000040 
## [20] train-error:0.000074+0.000004   test-error:0.000074+0.000040

Here We can see the train and test error rates at each iteration. And from 10th iteration onwards the error got more or less constant.

xgb_cv$evaluation_log %>%
  ggplot(aes(x=iter, y=train_error_mean, color='blue')) +
  geom_line()+
  geom_line(aes(y=test_error_mean, color="red"))+
  scale_color_manual(labels=c("Train error", "Test error"), values=c('blue', 'red'))

The optimal no. of rounds given by the xgboost seems to be 11 Rounds where the error on both train and validation set is minimum, also the model doesn’t seems to be overfitting, as both the train and validation error follows the same trend.

#fitting the model with watchlist on train and test datasets
fit <- xgb.train(trainDM, params = param, nrounds = 10,
                 watchlist = list(train_set=trainDM, test_set=testDM))
## [1]  train_set-error:0.002350    test_set-error:0.002220 
## [2]  train_set-error:0.002344    test_set-error:0.002212 
## [3]  train_set-error:0.002155    test_set-error:0.002024 
## [4]  train_set-error:0.002155    test_set-error:0.002039 
## [5]  train_set-error:0.000941    test_set-error:0.000925 
## [6]  train_set-error:0.000062    test_set-error:0.000051 
## [7]  train_set-error:0.000075    test_set-error:0.000051 
## [8]  train_set-error:0.000075    test_set-error:0.000051 
## [9]  train_set-error:0.000075    test_set-error:0.000051 
## [10] train_set-error:0.000075    test_set-error:0.000051
fit$evaluation_log %>%
  ggplot(aes(x=iter, y=train_set_error, color='blue')) +
  geom_line()+
  geom_line(aes(y=test_set_error, color="red"))+
  scale_color_manual(labels=c("Train error", "Test error"), values=c('blue', 'red'))

This above graph shows train and test errors at each iteration. Initially the error was a bit high for both train and test datasets, but then gradually came down and got more or less stabilized. Another thing to be noted is the test error is less than train error throughout the iterations, which means there is no presence of overfitting.

#predict train data
pred_xgb_train <- predict(fit, trainDM)

train_pred_class <- ifelse(pred_xgb_train>0.5,1,0)

confusionMatrix(as.factor(y_train), as.factor(train_pred_class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 321260      0
##          1     24    791
##                                      
##                Accuracy : 0.9999     
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.9975     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.985      
##                                      
##  Mcnemar's Test P-Value : 2.668e-06  
##                                      
##             Sensitivity : 0.9999     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 0.9706     
##              Prevalence : 0.9975     
##          Detection Rate : 0.9975     
##    Detection Prevalence : 0.9975     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

The above confusion matrix is based on train data which is the data that model had seen.

Based on the above results, there were about 321260 records which were belonged to normal transaction class and model also predicted them to be normal records which are correct classification. And about 791 records which were belonged to fraudulent transactions and model also predicted them to be fraud records which are correct classification.

But there were 24 records which were actually belonged to class normal transactions but the model predicted them to be fraud records which is incorrect prediction. For fraudulent records, this model had predicted 100% accurately with the train data.

#predict test data
pred_xgb_test <- predict(fit, testDM)

test_pred_class <- ifelse(pred_xgb_test>0.5,1,0)

confusionMatrix(as.factor(y_test), as.factor(test_pred_class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 137991      1
##          1      6    321
##                                      
##                Accuracy : 0.9999     
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.9977     
##     P-Value [Acc > NIR] : <2e-16     
##                                      
##                   Kappa : 0.9892     
##                                      
##  Mcnemar's Test P-Value : 0.1306     
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 0.9969     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 0.9817     
##              Prevalence : 0.9977     
##          Detection Rate : 0.9976     
##    Detection Prevalence : 0.9976     
##       Balanced Accuracy : 0.9984     
##                                      
##        'Positive' Class : 0          
## 

This confusion matrix is based on test data which is the data that model had not seen.

Based on the above results, there were about 137991 records which were belonged to normal transaction class and model also predicted them to be normal records which are correct classification. And about 321 records which were belonged to fraudulent transactions and model also predicted them to be fraud records which are correct classification.

But about 6 records which were actually belonged to class normal transactions but the model predicted them to be fraud records which is incorrect prediction. There is only one record in the prediction which was actually belonged to fraudulent transaction class, but the model predicted it as normal transaction, which is also misclassification error.

Also with Extreme Gradient Boosting algorithm, the model achieved approximately 99.9% accuracy on test data.

#variable importance per xgboost algorithm
xgb_impvar <- xgb.importance(feature_names = names(train[,-6]),model = fit)

xgb.ggplot.importance(xgb_impvar)

With this model, the feature NewBalanceOrig is the most important variable which is contributing significantly in predicting the fraudulent records.