library(knitr)

===============================================

1. About Data

This is a data set of digital transactions and each transaction is clearly labeled as Fraud or Non-Fraud as a ground truth.
The data set that we are using is not a synthetic data set instead it’s a real-world data set generated by Capital One and very much has the same kind of distribution and attributes that they work on their day-to-day basis.
Data Set has 29 attributes which are as follows:
Account Number
CustomerId
CreditLimit
AvailableMoney
TransactionDateTime
TransactionAmount
MerchantName
AcqCountry
MerchantCountryCode
PosEntryMode
PosConditionCode
MerchantCategoryCode
CurrentExpDate
AccountOpenDate
DateOfLastAddressChange
CardCVV
EnteredCVV
CardLast4Digit
TransactionType
EchoBuffer
CurrentBalance
MerchantCity
MerchantState
MerchantZip
CardPresent
PosOnPremises
RecurringAuthInd
ExpirationDateKeyInMatch
isFraud
The Data set has 786363 entries, which is a fair amount of data to model frauds and non-frauds after comprehensive analysis and visualizations.
Not all the columns are filled pre-hand, which very much represents a real-world dataset and it is important to ask questions about those column attributes to the business teams later on.
But what can we still do to come up with a baseline and a state-of-the-art performant model on whatever we have, is the target of the project or any real-world data science pipeline.
The nature of fraud and the non-fraud dataset is inherently imbalanced, which we can also see in the given dataset which shows only 1.6% of the time fraud has happened or reported.
So what can we do to mitigate the fraud after doing a rigorous analysis to predict fraud in various case scenarios?

Structure of Data:

Displaying Top 5 rows of the dataset.

head(df, n = 5)

Description and Basic Summary of the Dataset, which explains the type of columns present, their datatypes, total number of entries in that column

summary(df)

##  accountNumber       customerId         creditLimit   
##  Length:786363      Length:786363      Min.   :  250  
##  Class :character   Class :character   1st Qu.: 5000  
##  Mode  :character   Mode  :character   Median : 7500  
##                                        Mean   :10759  
##                                        3rd Qu.:15000  
##                                        Max.   :50000  
##  availableMoney  transactionDateTime transactionAmount
##  Min.   :-1006   Length:786363       Min.   :   0.00  
##  1st Qu.: 1077   Class :character    1st Qu.:  33.65  
##  Median : 3185   Mode  :character    Median :  87.90  
##  Mean   : 6251                       Mean   : 136.99  
##  3rd Qu.: 7500                       3rd Qu.: 191.48  
##  Max.   :50000                       Max.   :2011.54  
##  merchantName        acqCountry        merchantCountryCode
##  Length:786363      Length:786363      Length:786363      
##  Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   
##                                                           
##                                                           
##                                                           
##  posEntryMode       posConditionCode   merchantCategoryCode
##  Length:786363      Length:786363      Length:786363       
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
##  currentExpDate     accountOpenDate   
##  Length:786363      Length:786363     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##  dateOfLastAddressChange   cardCVV         
##  Length:786363           Length:786363     
##  Class :character        Class :character  
##  Mode  :character        Mode  :character  
##                                            
##                                            
##                                            
##   enteredCVV        cardLast4Digits    transactionType   
##  Length:786363      Length:786363      Length:786363     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   echoBuffer        currentBalance    merchantCity      
##  Length:786363      Min.   :    0.0   Length:786363     
##  Class :character   1st Qu.:  689.9   Class :character  
##  Mode  :character   Median : 2451.8   Mode  :character  
##                     Mean   : 4508.7                     
##                     3rd Qu.: 5291.1                     
##                     Max.   :47498.8                     
##  merchantState      merchantZip        cardPresent    
##  Length:786363      Length:786363      Mode :logical  
##  Class :character   Class :character   FALSE:433495   
##  Mode  :character   Mode  :character   TRUE :352868   
##                                                       
##                                                       
##                                                       
##  posOnPremises      recurringAuthInd  
##  Length:786363      Length:786363     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##  expirationDateKeyInMatch  isFraud       
##  Mode :logical            Mode :logical  
##  FALSE:785320             FALSE:773946   
##  TRUE :1043               TRUE :12417    
##                                          
##                                          
##

Displaying All the columns in the data at a glance. Columns Names:

colnames(df)

##  [1] "accountNumber"            "customerId"              
##  [3] "creditLimit"              "availableMoney"          
##  [5] "transactionDateTime"      "transactionAmount"       
##  [7] "merchantName"             "acqCountry"              
##  [9] "merchantCountryCode"      "posEntryMode"            
## [11] "posConditionCode"         "merchantCategoryCode"    
## [13] "currentExpDate"           "accountOpenDate"         
## [15] "dateOfLastAddressChange"  "cardCVV"                 
## [17] "enteredCVV"               "cardLast4Digits"         
## [19] "transactionType"          "echoBuffer"              
## [21] "currentBalance"           "merchantCity"            
## [23] "merchantState"            "merchantZip"             
## [25] "cardPresent"              "posOnPremises"           
## [27] "recurringAuthInd"         "expirationDateKeyInMatch"
## [29] "isFraud"

Dimension of dataset:

dim(df)

## [1] 786363     29

Number of Records = 786363 Number of Attributes/Columns = 29

Null values in data:

Which columns have NA values and further we’ll fill it with appropriate numbers in each column, after the analysis.

df[df == ""] <- NA                     
colSums(is.na(df))

##            accountNumber               customerId 
##                        0                        0 
##              creditLimit           availableMoney 
##                        0                        0 
##      transactionDateTime        transactionAmount 
##                        0                        0 
##             merchantName               acqCountry 
##                        0                     4562 
##      merchantCountryCode             posEntryMode 
##                      724                     4054 
##         posConditionCode     merchantCategoryCode 
##                      409                        0 
##           currentExpDate          accountOpenDate 
##                        0                        0 
##  dateOfLastAddressChange                  cardCVV 
##                        0                        0 
##               enteredCVV          cardLast4Digits 
##                        0                        0 
##          transactionType               echoBuffer 
##                      698                   786363 
##           currentBalance             merchantCity 
##                        0                   786363 
##            merchantState              merchantZip 
##                   786363                   786363 
##              cardPresent            posOnPremises 
##                        0                   786363 
##         recurringAuthInd expirationDateKeyInMatch 
##                   786363                        0 
##                  isFraud 
##                        0

Unique values in data:

Unique values in each attribute also helps in exploring the structure of the dataset.
When an attribute has less number of Unique Values it is always a good criteria to look at them keenly to know more about the structure as a whole and how the other attributes are behaving around these small buckets.

require(dplyr)
sapply(df, n_distinct)

##            accountNumber               customerId 
##                     5000                     5000 
##              creditLimit           availableMoney 
##                       10                   521916 
##      transactionDateTime        transactionAmount 
##                   776637                    66038 
##             merchantName               acqCountry 
##                     2490                        5 
##      merchantCountryCode             posEntryMode 
##                        5                        6 
##         posConditionCode     merchantCategoryCode 
##                        4                       19 
##           currentExpDate          accountOpenDate 
##                      165                     1820 
##  dateOfLastAddressChange                  cardCVV 
##                     2184                      899 
##               enteredCVV          cardLast4Digits 
##                      976                     5246 
##          transactionType               echoBuffer 
##                        4                        1 
##           currentBalance             merchantCity 
##                   487318                        1 
##            merchantState              merchantZip 
##                        1                        1 
##              cardPresent            posOnPremises 
##                        2                        1 
##         recurringAuthInd expirationDateKeyInMatch 
##                        1                        2 
##                  isFraud 
##                        2

Response attribute “isFraud” :

The most important attribute of the given dataset, which can be a label in a classification problem.

require(dplyr)
dplyr::count(df, isFraud, sort = TRUE)

isFraud attribute has 1.6% True Values.

2. Questions

Evaluating the distribution of the various attributes in the dataset, whether they are numerical or categorical, mapping into optimal buckets.
What types of Merchants exist in the dataset, online, offline, etc., and which merchants should be prioritized from a business point of view.
What kind of transactions is happening in the given data, and which is the dominant type?
If the card Present during the transaction has any causation or correlation with the fraud.
Deep Dive into the numerical attributes of the card, which may have a direct impact and can be of high importance, e.g., credit card limit, available balance, etc.
The important Merchant names where most of the fraud happens is a must question to address.
Which customers are experiencing most fraud cases and where are they doing most of their transactions, to take important actions.
Correlation between numerical attributes of the credit card, important to question the causation with fraud in the end.
The optimal way to remove empty columns (either consult the business teams or perform whatever is in hand and later on deep dive into the business domain that if those features are important from a business perspective and how can we do data acquisition for such attributes.)
Feature Engineering and Data Transformation to add more sense to data and may enhance modeling performance.
Analysis to see how many transactions can be labeled as Duplicate Transactions as it can be Multi-Swiped Transactions which may have a high chance of fraud.
What are the important attributes, to identify the duplicate transactions and check the percentage of these transactions?
Identifying the outliers in the attributes and optimal reasoning to remove those for better generalization, may help in modeling distribution to not be highly skewed.
How different time slots of the day can be generated from the existing Date Time attribute and how these slots can help enhance the performance.
Why does removing unique identifiers make sense in data modeling, eg., IDs of customers, etc.?
The optimal way to fill the null values, based on the distribution of the data or mean/median/mode.
Treating the highly imbalanced dataset with Under-sampling and over-sampling
Baseline Modeling: Logistic Regression, K-Nearest Neighbor
Evaluating results on various metrics: Accuracy, F1 Score, Precision, Recall, Confidence Interval, Significance of P-value, ROC, and AUC curves.
Interpretability of features: Decision Tree
Feature Importance: Random Forest
High Performant Model: XGBoost
Principal Component Analysis for addressing the predictability and how much variance can be explained with a few PCA components.
Summary of Results and Impact.
Future Deep Diving in Real-world scenarios.

3. Data Visualization and Answers to Questions:

This and follwing sections will try to answer the questions mentioned above with various graphs that may include: Plots, Histograms, Box Plots, Correlation Matrix and Hypothesis about the Structure of Data.

Converting True/False Booleans as 0s and 1s. As numerical values are meaningful to deal with.

df$isFraud <- as.integer(df$isFraud)
df$cardPresent <- as.integer(df$cardPresent)
df$expirationDateKeyInMatch <- as.integer(df$expirationDateKeyInMatch)

Buckets/Unique Values in Credit Card Limit Column are shown below:

Most of the Credit Cards have 5000 as their limit amount, on second number it is 15000.
Interestingly, credit cards with 50,000 limit also exists in the dataset*

library(RColorBrewer)
e = count(df, creditLimit)
e <- e[complete.cases(e), ]
coul <- brewer.pal(8, "Set2") 
par(mar  = c(4.1, 4 ,2.1 ,0))
barplot(height= e$n, names= e$creditLimit, xlab = 'creditLimit' , ylab = 'Frequency', col=coul)

Question:

4 Types of Merchant Country Code exists in the dataset, where US is a dominant entity as shown.

library(RColorBrewer)
coul <- brewer.pal(8, "Set2") 
e = count(df, merchantCountryCode)
e <- e[complete.cases(e), ]
par(mar  = c(4.1, 4 ,2.1 ,0))
barplot(height= e$n, names= e$merchantCountryCode, xlab = 'merchantCountryCode' , ylab = 'Frequency', col=coul, horiz = TRUE)

5 POS Entery Mode exists where “05”, “09”, and “02” covers most of the values.

library(plotrix)
e = count(df, posEntryMode)
e <- e[complete.cases(e), ]
pie3D(e$n,labels = e$posEntryMode,explode = 0.1, main = "Pie Chart of posEntryMode ", mar = rep(1, 4),col = hcl.colors(length(e$n), "Spectral"))

3 POS Condition Mode exists where “01” is dominant among all.

library(RColorBrewer)
coul <- brewer.pal(8, "Set2") 
e = count(df, posConditionCode)
e <- e[complete.cases(e), ]
par(mar  = c(4.1, 4 ,1.1 ,0))
barplot(height= e$n, names= e$posConditionCode, xlab = 'posConditionCode' , ylab = 'Frequency', col=coul)

Question:

There are 3 Types of Transaction types:

1. Purchase
1. Reversal
1. Address Verification
Most values lie among Purchase Type of transaction. But it is necessary to explore the other two types too, to analyze what is happening in those transactions.

library(lessR)
merchant_with_fraud1 = subset(df, transactionType != is.na(transactionType), select = transactionType)
cols <- hcl.colors(length(unique(merchant_with_fraud1$transactionType)), "Zissou 1")
PieChart(transactionType,data = merchant_with_fraud1,values="%",
         fill = "viridis",
         main = "Transaction types",
         color = "black",
         lwd = 1.5,
         values_color = c(rep("white", 2), 1),
         values_size = 0.8)

## >>> Suggestions
## PieChart(transactionType, hole=0)  # traditional pie chart
## PieChart(transactionType, values="%")  # display %'s on the chart
## PieChart(transactionType)  # bar chart
## Plot(transactionType)  # bubble plot
## Plot(transactionType, values="count")  # lollipop plot 
## 
## --- transactionType --- 
## 
##        transactinTyp    Count   Prop 
## ------------------------------------- 
## ADDRESS_VERIFICATION    20169   0.026 
##             PURCHASE   745193   0.948 
##             REVERSAL    20303   0.026 
## ------------------------------------- 
##                Total   785665   1.000 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 1337879.797, df = 2, p-value = 0.000

If card is Present during the transaction, it plays an important role and dataset is fairly distributed among the two values as True/False

library(plotrix)
e = count(df, cardPresent)
e <- e[complete.cases(e), ]
pie3D(e$n,labels = e$cardPresent,explode = 0.1, main = "Card Present during transaction ", mar = rep(1.75, 4),col = hcl.colors(length(e$n), "Spectral"))

Expiration Date Key In Match is highly skewed to 0 value

library(RColorBrewer)
coul <- brewer.pal(8, "Set2") 
e = count(df, expirationDateKeyInMatch)
e <- e[complete.cases(e), ]
par(mar  = c(4.1, 4 ,1.1 ,0))
barplot(height= e$n, names= e$expirationDateKeyInMatch, xlab = 'expirationDateKeyInMatch' , ylab = 'Frequency', col=coul)

Count of expirationDateKeyInMatch

count(df, expirationDateKeyInMatch)

Plot showing frequency of Fraud
As described earlier, it is clearly an example of an IMBALANCED DATSET with respect to isFraud attribute. But It is expected to be like that in real world scenario as Frauds are always less than 5% with respect to whole dataset.
Here it is 1.6%
Finding Correlation and Causation aspect with respect to fraud transactions

Question:

List of top merchants

Here we can clearly see which are the Top Merchants:
To know more about the transaction data set it is important to know the Top Merchants where most of the Transactions happen. It helps in Optimizing the metrics and to prioritize the business and data strategies accordingly.
1. Uber
1. Lyft
1. oldnavy.com
1. staples.com
1. alibaba.com
1. apple.com
1. walmart.com
1. cheapfast.com
1. ebay.com
1. target.com

important_merchants = count(df, merchantName)
important_merchants <- filter(important_merchants, n > 10000)
arrange(important_merchants, -n)

Question:

Top Transaction Types, in terms of their name, to understand the behavior of data.

Here we can clearly see that most transactions happen at:
1. Online Retail
1. Food
1. Entertainment
1. Online Gifts
1. Ride Share
1. Hotels

library("ggplot2")
e = count(df, merchantCategoryCode)
e <- e[complete.cases(e), ]
ggplot(e, aes(reorder(merchantCategoryCode, -n), n)) + 
  geom_bar(stat = "identity",color='skyblue',fill='steelblue')+
  theme(axis.text.x = element_text(angle = 90, size = 10))

Question:

Let’s explore Numerical Valued Attributes in Credit Card Transactions these attributes hold high importance because they directly impact the response variable isFraud.

df_num = df[ , c("creditLimit", "availableMoney","transactionAmount", "currentBalance")]
head(df_num, n = 5)

Separatelty describing these numerical attributes to get the essence of data and its distribution.

Important Note:

Here we can see Available Money can also be Negative, which is -1005.63 as minimum value*

summary(df_num)

##   creditLimit    availableMoney  transactionAmount
##  Min.   :  250   Min.   :-1006   Min.   :   0.00  
##  1st Qu.: 5000   1st Qu.: 1077   1st Qu.:  33.65  
##  Median : 7500   Median : 3185   Median :  87.90  
##  Mean   :10759   Mean   : 6251   Mean   : 136.99  
##  3rd Qu.:15000   3rd Qu.: 7500   3rd Qu.: 191.48  
##  Max.   :50000   Max.   :50000   Max.   :2011.54  
##  currentBalance   
##  Min.   :    0.0  
##  1st Qu.:  689.9  
##  Median : 2451.8  
##  Mean   : 4508.7  
##  3rd Qu.: 5291.1  
##  Max.   :47498.8

Through analysis it can be seen that most of the Credit limits are under 10,000.
But some goes till 50,000.
While data Modelling this information can be really useful.

par(mfrow=c(2,2))
hist(df[ , c("creditLimit")], main="creditLimit", xlab="Frequency", ylab="credit Limit", col = "blue")
hist(df[ , c("availableMoney")], main="availableMoney", xlab="Frequency", ylab="availableMoney", col = "blue")
hist(df[ , c("transactionAmount")], main="transactionAmount", xlab="Frequency", ylab="transaction Amount", col = "blue")
hist(df[ , c("currentBalance")], main="currentBalance", xlab="Frequency", ylab="currentBalance", col = "blue")

Available Money mostly lie under the buckets of less than 10,000. It is Right Skewed, which actually makes sense in the amount figures.
Interestingly, Transaction amount mostly falls under 500 and is Right Skewed.

Question:

Box Plots help analyzing the data distribution and beautifully caters Outliers in the datasets and the skewness.

It can explain the quantiles of the distribution.
It shows how these Numerical Atrributes are Right Skewed

par(mfrow=c(2,2))
boxplot(df_num[ , c("creditLimit")], col = "green", main="credit Limit")
boxplot(df_num[ , c("availableMoney")], col = "red", main="available Money")
boxplot(df_num[ , c("transactionAmount")], col = "blue", main="transaction Amount")
boxplot(df_num[ , c("currentBalance")], col = "purple", main="current Balance")

Most Fradulant Merchant Types

Merchant Type with highest fraud transaction

merchant_with_fraud = subset(df, isFraud == '1', select = merchantCategoryCode)
head(merchant_with_fraud, n = 10)

Question:

Top 3 Merchant Categories are:

1. Online Retail
1. Online Gifts
1. Ride Share

library("ggplot2")
e = count(merchant_with_fraud, merchantCategoryCode)
e <- e[complete.cases(e), ]
ggplot(e, aes(reorder(merchantCategoryCode, -n), n)) + 
  geom_bar(stat = "identity",color='skyblue',fill='brown')+
  theme(axis.text.x = element_text(angle = 90, size = 10))

Question:

The Merchant Names where most Fraud happens:

Top 5 are:

1. Lyft
1. ebay.com
1. Fresh Flowers
1. Uber
1. walmart.com

merchant_name__with_fraud = subset(df, isFraud == '1', select = merchantName)
important_merchants = count(merchant_name__with_fraud, merchantName)
important_merchants = filter(important_merchants, n > 300)
arrange(important_merchants, -n)

Question:

The Customers and account IDs which get most frauds can also be analyzed through dataset.

Top 3 Account IDs are:

1. 380680241
1. 782081187
1. 246251253

accountNumber_with_fraud = subset(df, isFraud == '1', select = accountNumber)
important_merchants = count(accountNumber_with_fraud, accountNumber)
important_merchants = filter(important_merchants, n > 200)
arrange(important_merchants, -n)

Important Note: Account IDs and Customer IDs are the same.

customers_with_fraud = subset(df, isFraud == '1', select = customerId)
important_merchants = count(customers_with_fraud, customerId)
important_merchants = filter(important_merchants, n > 200)
arrange(important_merchants, -n)

Correlation Matrix between Numerical Attributes

Tells important information about the inter-correlation.

library(psych)
corPlot(df_num, cex = 1.2)

4. Data Wrangling and further analysis

Reason for removing these 6 columns:

The empty columns are removed because in the previous analysis we could see that 6 attributes are inherently empty by the data provider. It can further be investigated if these attributes hold importance in terms of fraud prediction. But for now, the challenge is to predict fraudulent transactions without getting these attributes, which makes the problem inherently more challenging. It can clearly be translated to a real-world problem where more acquisition of data can be done to perform better.

drops <- c('echoBuffer','merchantCity','merchantState','merchantZip','posOnPremises','recurringAuthInd')
df <- df[ , !(names(df) %in% drops)]

Data frame after dropping mentioned 6 columns

head(df, n = 5)

df_2 <- df

Question:

Analyzing Duplicate, Reveresal and Multi-Swiped Transactions

The assumption is: Duplicate Transactions are any record other than Address Verification because it has Transaction Amount = 0

df_2 <- subset(df_2, transactionType != "ADDRESS_VERIFICATION" | is.na(transactionType))

Adding Date Column in the Data Frame to compute Results independent of just Date-Time

df_2$transactionDate <- as.Date(df_2$transactionDateTime)

head(df_2, n = 5)

Duplicate transaction

Here we are taking those attributes which can uniquely identify and help in identifying a duplicate transaction at a given point:

Question:

The attributes used are:

1. accountNumber (Reason: It will be unique for that customer)
1. transactionAmount (Reason: the amount needs to be checked for duplicated occurrances)
1. merchantName (Reason: where the transaction is actually taking place)
1. acqCountry (Reason: the country where it can take place)
1. accountOpenDate (Reason: account open date will be same)
1. merchantCategoryCode (Reason: Sometimes, data do not come as we expect it to be, where merchantName can come as an empty index, then merchantCategoryCode will be most useful in that scenario)
1. cardLast4Digits (Reason: always help in analyzing duplicate transactions as an identifier)

Duplicated Trnsactions = 100716

duplicate_trans = df_2[duplicated(df_2[,c('accountNumber', 'transactionAmount','merchantName', 'acqCountry' , 'accountOpenDate' ,'merchantCategoryCode','cardLast4Digits')]) | duplicated(df_2[,c('accountNumber', 'transactionAmount','merchantName', 'acqCountry' , 'accountOpenDate' ,'merchantCategoryCode','cardLast4Digits')], fromLast=TRUE), ]
head(duplicate_trans, n = 5)

dim(duplicate_trans)

## [1] 100716     24

Question:

Frauds in Duplicated Transaction is expected to be more.

The Top 3 Merhcants where duplicated transactions happened and Frauds took place are:

1. Fresh Flowers
1. Lyft
1. ebay.com

duplicate_trans_fraud = subset(duplicate_trans, isFraud == '1', select = merchantName)
dup_fraud = count(duplicate_trans_fraud, merchantName)
dup_fraud = filter(dup_fraud, n >= 30)
arrange(dup_fraud, -n)

###Question: ### Reversal Transactions can be found directly by using the attribute Transaction Type.

That comes out to be Reversal = 20303
Transaction Amount of Reversal Transaction = 2821792

nrow(subset(df_2, transactionType == 'REVERSAL'))

## [1] 20303

 sum(subset(df_2, transactionType == 'REVERSAL')$transactionAmount)

## [1] 2821792

###Question: ### Merchnats where Frauds took place according to the given attribute transaction type = Reversal in the original dataset:

Most Frauds Happening within Original data and reversal transaction type are at these places:

1. Lyft
1. walmart.com
1. gap.com

df_2_rev = subset(df_2, isFraud == '1', select = c(merchantName, transactionType))
df_2_rev = subset(df_2_rev, transactionType == 'REVERSAL', select = merchantName)

df_2_rev_fraud = count(df_2_rev, merchantName)
df_2_rev_fraud = filter(df_2_rev_fraud, n >= 10)
arrange(df_2_rev_fraud, -n)

###Question: ### 2nd method to find Reversal Transaction is to look directly into the duplicated transactions:

Reversal Transactions in Duplicated records = 17826
Transaction Amount for Reversal Transactions in Duplicated records = 2669860

nrow(subset(duplicate_trans, transactionType == 'REVERSAL'))

## [1] 17826

 sum(subset(duplicate_trans, transactionType == 'REVERSAL')$transactionAmount)

## [1] 2669860

Question:

Most Frauds that happen in Reversal Transactions of Duplicated Subset are for these Merchants:

1. Lyft
1. Fresh Flowers
1. Walmart.com

duplicate_trans_fraud = subset(duplicate_trans, isFraud == '1', select = c(merchantName, transactionType))
reverse_trans_fraud = subset(duplicate_trans_fraud, transactionType == 'REVERSAL', select = merchantName)

rev_fraud = count(reverse_trans_fraud, merchantName)
rev_fraud = filter(rev_fraud, n >= 10)
arrange(rev_fraud, -n)

library(ggparliament)
library(tidyverse)

## ── Attaching packages ─────────────────── tidyverse 1.3.1 ──

## ✔ tibble  3.1.7     ✔ purrr   0.3.4
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()     masks ggplot2::%+%()
## ✖ psych::alpha()   masks ggplot2::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ lessR::recode()  masks dplyr::recode()
## ✖ lessR::rename()  masks dplyr::rename()

ru_semicircle <- parliament_data(election_data = rev_fraud,
                                 type = "semicircle", # Parliament type
                                 parl_rows = 5,      # Number of rows of the parliament
                                 party_seats = rev_fraud$n) # Seats per party

ggplot(ru_semicircle, aes(x = x, y = y, colour = merchantName)) +
  geom_parliament_seats() + 
  theme_ggparliament() +
  labs(title = "merchantName")

Removing Outliers

As we could see that data distribution was highly skewed in the numerical attributes, and it can drastically affect the model performance because fitting the line or plane would not be able to generalize well, so it is always useful to remove such outliers and have a better generalization and Test accuracy.

Question:

Method used to handle Outliers: Interquartile Ranges are generally used for removing the outliers in a numerical attribute

{
  df <- df[df$creditLimit > quantile(df$creditLimit, .25) - 1.5*IQR(df$creditLimit) & 
        df$creditLimit < quantile(df$creditLimit, .75) + 1.5*IQR(df$creditLimit), ]
  head(df, n = 5)
}

We can see which buckets are removed from the dataset. As for Models to generalize well on the dataset, it is advised to remove outliers. But in some use cases, we avoid that as well.

library(RColorBrewer)
coul <- brewer.pal(8, "Set2") 
e = count(df, creditLimit)
e <- e[complete.cases(e), ]
par(mar  = c(4, 4 ,2 ,0))
barplot(height= e$n, names= e$creditLimit, col=coul, xlab = 'creditLimit' , ylab = 'Frequency' )

Data Visualization after handling outliers for better generalization.

df <- df[df$availableMoney > quantile(df$availableMoney, .25) - 1.5*IQR(df$availableMoney) & 
        df$availableMoney < quantile(df$availableMoney, .75) + 1.5*IQR(df$availableMoney), ]
df <- df[df$transactionAmount > quantile(df$transactionAmount, .25) - 1.5*IQR(df$transactionAmount) & 
        df$transactionAmount < quantile(df$transactionAmount, .75) + 1.5*IQR(df$transactionAmount), ]
df <- df[df$currentBalance > quantile(df$currentBalance, .25) - 1.5*IQR(df$currentBalance) & 
        df$currentBalance < quantile(df$currentBalance, .75) + 1.5*IQR(df$currentBalance), ]
par(mfrow=c(2,2))
hist(df[ , c("creditLimit")], main="creditLimit", xlab="Frequency", ylab="credit Limit", col = "blue")
hist(df[ , c("availableMoney")], main="availableMoney", xlab="Frequency", ylab="availableMoney", col = "blue")
hist(df[ , c("transactionAmount")], main="transactionAmount", xlab="Frequency", ylab="transactionAmount", col = "blue")
hist(df[ , c("currentBalance")], main="currentBalance", xlab="Frequency", ylab="currentBalance", col = "blue")

Box Plots are again drawn to help visualize how outliers were impacting the overall skewness.

par(mfrow=c(2,2))
boxplot(df[ , c("creditLimit")], col = "green", main="credit Limit")
boxplot(df[ , c("availableMoney")], col = "red", main="availableMoney")
boxplot(df[ , c("transactionAmount")], col = "blue", main="transaction Amounty")
boxplot(df[ , c("currentBalance")], col = "purple", main="current Balance")

4. Feature Engineering

The reason for feature engineering is to make additional attributes from existing attributes that can be used effectively in the modeling and may enhance model predictions.

Making an additional Column derived from the equality of CardCVV and EnteredCVV as shown

df$CVVMatched <-  df$cardCVV == df$enteredCVV

Transormation from Categorical and Booleans to Numerical

df$CVVMatched <- as.integer(df$CVVMatched)

dim(df)

## [1] 617105     24

Making 4 set of values from time column: morning, afternoon/evening, night and late/mid night are also added as feature.

df_time <- data.frame(df)

head(df_time,n=5 )

library("lubridate")

df_time <- df_time %>%
  mutate(hour_admit = hour(strptime(transactionDateTime, format = "%Y-%m-%dT%H:%M:%S"))) %>%
  mutate(time_period = case_when(
    hour_admit > 05 & hour_admit < 11 ~ 0,
    hour_admit >= 11 & hour_admit < 17 ~ 1,
    hour_admit >= 17 & hour_admit < 23 ~ 2,
    hour_admit >=23 | hour_admit <= 5 ~ 3))

df$transactionTimeCat = df_time$time_period

head(df, n=5)

No of Transaction based on Four timing range

count(df, transactionTimeCat)

Count of CVV Matched and not mached

count(df, CVVMatched)

Here we are removing the attributes which are unique identifiers like Customer IDs, Account IDs etc.

As these attributes are not useful for learning patterns, these are just present for uniquely identifying the rows in the database.
Also, we have made additional feature columns from the Date Time so, the basic time attribute is not needed anymore, so removing that as well to make the model generalizable.
Other similar unique identifiers are removed for the same reason.

drops <- c('accountNumber','customerId', 'transactionDateTime', 'cardLast4Digits', 'cardCVV', 'enteredCVV', 'accountOpenDate', 'dateOfLastAddressChange', 'currentExpDate')
df <- df[ , !(names(df) %in% drops)]
print('Final total columns')

## [1] "Final total columns"

ncol(df)

## [1] 16

Categorical to Numerical Fitting through LabelEncoder

library(superml)

le <- LabelEncoder$new()

columns <- c('merchantName', 'acqCountry', 'merchantCountryCode', 'merchantCategoryCode', 'transactionType', 'posEntryMode', 'posConditionCode')

label <- LabelEncoder$new()
df$merchantName <- label$fit_transform(df$merchantName)

df$acqCountry <- label$fit_transform(df$acqCountry)

df$merchantCountryCode <- label$fit_transform(df$merchantCountryCode)

df$merchantCategoryCode <- label$fit_transform(df$merchantCategoryCode)

df$transactionType <- label$fit_transform(df$transactionType)

df$posEntryMode <- label$fit_transform(df$posEntryMode)

df$posConditionCode <- label$fit_transform(df$posConditionCode)

df = df %>% 
   mutate_all(~ifelse(is.na(.), mode(., na.rm = TRUE), .))

5. Handling Imbalance of Data Set

Question

As the Data is Imbalanced, so to run classification models efficiently it is advised to use a Down or Upper Sampling technique to balance the dataset for better results

#install.packages(c("zoo","xts","quantmod"))
#install.packages("smotefamily")

Using OverSampling Technique

The ROSE provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.

library(DMwR2) # for smote implementation
library(ROSE)# for ROSE sampling

# smote
set.seed(9560)
rose_train <- ROSE(isFraud ~ ., data  = df)$data

table(rose_train$isFraud)

## 
##      0      1 
## 309174 307931

Using UnderSampling Technique

Ovun creates possibly balanced samples by random over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling.

under_v1 <- ovun.sample(isFraud ~ .,
                        data = df,
                        method = "under",
                        N = 2 * sum(df$isFraud == 1))

df_under = under_v1$data

table(df_under$isFraud)

## 
##    0    1 
## 8847 8847

6. Modeling:

For Model Evaluation Always Train the model on Training dataset and Verify on test data sets.

Baseline: Logistic Regression

In logistic regression, we decide a probability threshold. If the probability of a particular element is higher than the probability threshold then we classify that element in one group or vice versa.
Logistic Regression model can be used as a baseline classifier and it is really a robust model to compare with. It is always advised to start with simpler models and then keep on adding complexity to have a clear trade-off between Variance and Bias.

# Libraries
library(pROC, quietly=TRUE)
library(microbenchmark, quietly=TRUE)

# Set seed so the train/test split is reproducible

# Loading package
library(caTools)
library(ROCR) 

ds_noads = df_under
   

#split <- sample.split(ds_noads, SplitRatio = 0.7)


#train <- subset(ds_noads, split == "TRUE")
#test <- subset(ds_noads, split == "FALSE")

#create a list of random number ranging from 1 to number of rows from actual data 

#and 70% of the data into training data  


data2 = sort(sample(nrow(ds_noads), nrow(ds_noads)*.8))


#creating training data set by selecting the output row values

train <- ds_noads[data2,]


#creating test data set by not selecting the output row values

test <- ds_noads[-data2,]

# Training model
logistic_model <- glm(isFraud ~ ., 
                      data = train, 
                      family = "binomial")

#logistic_model

# Summary
summary(logistic_model)

## 
## Call:
## glm(formula = isFraud ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0817  -1.0694  -0.5585   1.0963   2.0292  
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate   Std. Error z value
## (Intercept)               0.203188709  0.168595143   1.205
## creditLimit               0.000011916  0.000006803   1.752
## availableMoney           -0.000012060  0.000008381  -1.439
## transactionAmount         0.004758081  0.000169930  28.000
## merchantName             -0.000465575  0.000041221 -11.295
## acqCountry                0.043301686  0.228171315   0.190
## merchantCountryCode       0.084819309  0.233430307   0.363
## posEntryMode             -0.113553464  0.017284093  -6.570
## posConditionCode         -0.075115314  0.039741254  -1.890
## merchantCategoryCode      0.006528148  0.004391111   1.487
## transactionType          -0.027811227  0.050030075  -0.556
## currentBalance                     NA           NA      NA
## cardPresent              -0.420620035  0.041709599 -10.084
## expirationDateKeyInMatch  0.795534257  0.514831351   1.545
## CVVMatched               -0.433319209  0.159294140  -2.720
## transactionTimeCat       -0.014773995  0.015988819  -0.924
##                                      Pr(>|z|)    
## (Intercept)                           0.22813    
## creditLimit                           0.07984 .  
## availableMoney                        0.15017    
## transactionAmount        < 0.0000000000000002 ***
## merchantName             < 0.0000000000000002 ***
## acqCountry                            0.84948    
## merchantCountryCode                   0.71634    
## posEntryMode                  0.0000000000504 ***
## posConditionCode                      0.05874 .  
## merchantCategoryCode                  0.13710    
## transactionType                       0.57829    
## currentBalance                             NA    
## cardPresent              < 0.0000000000000002 ***
## expirationDateKeyInMatch              0.12229    
## CVVMatched                            0.00652 ** 
## transactionTimeCat                    0.35548    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19623  on 14154  degrees of freedom
## Residual deviance: 18139  on 14140  degrees of freedom
## AIC: 18169
## 
## Number of Fisher Scoring iterations: 4

Dimentions of Train data

dim(train)

## [1] 14155    16

Dimentions of Test data

dim(test)

## [1] 3539   16

predict_reg <- predict(logistic_model, 
                       test, type = "response")
#predict_reg  

# Changing probabilities
predict_reg <- ifelse(predict_reg >0.5, 1, 0)

# Evaluating model accuracy
# using confusion matrix
table(test$isFraud, predict_reg)

##    predict_reg
##        0    1
##   0 1209  547
##   1  680 1103

missing_classerr <- mean(predict_reg != test$isFraud)
print(paste('Accuracy =', 1 - missing_classerr))

## [1] "Accuracy = 0.65329189036451"

as.data.frame(table(ds_noads$isFraud))

ROCPred <- prediction(predict_reg, test$isFraud) 
ROCPer <- performance(ROCPred, measure = "tpr", 
                      x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.6535584

# Plotting curve
plot(ROCPer)

plot(ROCPer, colorize = TRUE, 
     print.cutoffs.at = seq(0.1, by = 0.1), 
     main = "ROC CURVE LOGISTIC REGRESSION")
abline(a = 0, b = 1)

auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

library(e1071)

library(caret)

confusionMatrix(factor(predict_reg), factor(test$isFraud))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1209  680
##          1  547 1103
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6373, 0.669)      
##     No Information Rate : 0.5038               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3069               
##                                                
##  Mcnemar's Test P-Value : 0.0001643            
##                                                
##             Sensitivity : 0.6885               
##             Specificity : 0.6186               
##          Pos Pred Value : 0.6400               
##          Neg Pred Value : 0.6685               
##              Prevalence : 0.4962               
##          Detection Rate : 0.3416               
##    Detection Prevalence : 0.5338               
##       Balanced Accuracy : 0.6536               
##                                                
##        'Positive' Class : 0                    
##

confusionMatrix(factor(predict_reg), factor(test$isFraud), mode = "everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1209  680
##          1  547 1103
##                                                
##                Accuracy : 0.6533               
##                  95% CI : (0.6373, 0.669)      
##     No Information Rate : 0.5038               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3069               
##                                                
##  Mcnemar's Test P-Value : 0.0001643            
##                                                
##             Sensitivity : 0.6186               
##             Specificity : 0.6885               
##          Pos Pred Value : 0.6685               
##          Neg Pred Value : 0.6400               
##               Precision : 0.6685               
##                  Recall : 0.6186               
##                      F1 : 0.6426               
##              Prevalence : 0.5038               
##          Detection Rate : 0.3117               
##    Detection Prevalence : 0.4662               
##       Balanced Accuracy : 0.6536               
##                                                
##        'Positive' Class : 1                    
##

The Results of Logisitc Regression Classifier:

Accuracy = 0.6535 = 65.35%

Here, it can be seen that the logistic regression model tried to draw the boundary between the fraud and non-fraud data set and could perform with 65% of accuracy on the given attributes. Here, the problem is inherently challenging as the real-world transaction data-set may have 100 and 1000s of predictive attributes (and it is not a synthetic data set, so it is an expected result on a real-day dataset), and on the given attributes 65% is still a significant number with a great precision and recall ratio.

Alternative Basline: K-Nearest Neighbor

KNN tries to predict the correct class for the test data by calculating the distance between the test data and all the specified training points. Then select the K number of points which is closet to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest probability will be selected.

The hyperparameter is tuned between 5-9 of K values, which is bes suited for our use-case.

fitControl <- trainControl(method="cv",
                            number = 5,
                            preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
                            classProbs = TRUE,
                            summaryFunction = twoClassSummary)

train$isFraud <- factor(train$isFraud)

model_knn <- train(make.names(isFraud)~.,
                   train,
                   method="knn",
                   metric="ROC",
                   preProcess = c('center', 'scale'),
                   tuneLength=10,
                   trControl=fitControl)

pred_knn <- predict(model_knn, test)

ROCPred <- prediction(as.numeric(pred_knn), test$isFraud) 
ROCPer <- performance(ROCPred, measure = "tpr", 
                      x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.6590938

# Plotting curve
plot(ROCPer)

plot(ROCPer, colorize = TRUE, 
     print.cutoffs.at = seq(0.1, by = 0.1), 
     main = "ROC CURVE KNN")
abline(a = 0, b = 1)

auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

The Results of K nearest neighbour:

Accuracy = 0.6590 = 65.90%

The result is very much comparable to the Logistic regression technique. So, it can clearly be seen that we need to add more variance in the model, which we are going to do through Tree-based models. The simple reason is that to still have the interpretability from a business point of view that which features carry more predictiveness and help in decision making.

Tree-Based Classifiers for better Accuracy and Interpretability

Decision Tree to identify important predictive features to enhance the variance of the model.

A decision tree is drawn upside down with its root at the top. It starts with a condition/internal node, based on which the tree splits into branches/ edges. The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the fraud happens or not, respectively. Inherently it can be seen that it explains which features are important and can help in decision-making with its hierarchical structure.

The decision tree provides us with the interpretability regarding the decision made over the feature set, which may lead to good accuracy and shows a clear hierarchy

raw.data = df_under


nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset =   raw.data[-indexT,]

train = trainset
test = verset

library(rpart) # for regression trees
library(randomForest) # for random forests


# train a decision tree based on our dataset 
tree.model <- rpart(isFraud ~ ., data = train)

Decison Tree developed over dataset

# plot our regression tree 
plot(tree.model, uniform=TRUE)
# add text labels & make them 60% as big as they are by default
text(tree.model, cex=.6)

The decision tree clearly shows how important the transaction amount is, then the predictive power of posEntryMode and later on, if the card is present or not, has a high impact on the response variable “isFraud”.

Ensemble: Random Forest for Feature Importance as it is more robust than a decision tree, and does not easily overfit.

Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

It provides the feature importance with the help of Gini index and Information Gain, which helps in determining which feature influence more on the fradulent transactions.

raw.data = df_under


nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset =   raw.data[-indexT,]
n <- names(trainset)
rf.form <- as.formula(paste("isFraud ~", paste(n[!n %in% "isFraud"], collapse = " + ")))
trainset.rf <- randomForest(rf.form,trainset,ntree=100,importance=T)

Plots showing feature importance

varimp <- data.frame(trainset.rf$importance)
  vi1 <- ggplot(varimp, aes(x=reorder(rownames(varimp),IncNodePurity), y=IncNodePurity)) +
  geom_bar(stat="identity", fill="tomato", colour="black") +
  coord_flip() + theme_bw(base_size = 8) +
  labs(title="Prediction using RandomForest with 100 trees", subtitle="Variable importance (IncNodePurity)", x="Variable", y="Variable importance (IncNodePurity)")

  vi2 <- ggplot(varimp, aes(x=reorder(rownames(varimp),X.IncMSE), y=X.IncMSE)) +
  geom_bar(stat="identity", fill="lightblue", colour="black") +
  coord_flip() + theme_bw(base_size = 8) +
  labs(title="Prediction using RandomForest with 100 trees", subtitle="Variable importance (%IncMSE)", x="Variable", y="Variable importance (%IncMSE)")

  
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
    
grid.arrange(vi1, vi2, ncol=2)

Random Forest clearly identifies important features with high impact, the Top 5 are:
Transaction Amount
Merchant Name
Current Balance
Available Money
POS Entry Mode and others will go inside the final model.

One of the most Powerful one: XGBoost (Ensembling Method)

XGBoost is a gradient boosting on ‘steroids’. It is a perfect combination of software and hardware optimization techniques to yield superior results using less computing resources in the shortest amount of time. The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in machine learning because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters.

library(xgboost, quietly=TRUE)

raw.data = df_under


nrows <- nrow(raw.data)
set.seed(314)
indexT <- sample(1:nrow(raw.data), 0.8 * nrows)
#separate train and validation set
trainset = raw.data[indexT,]
verset =   raw.data[-indexT,]

train = trainset
test = verset


xgb.data.train <- xgb.DMatrix(as.matrix(train[, colnames(train) != "isFraud"]), label = train$isFraud)
xgb.data.test <- xgb.DMatrix(as.matrix(test[, colnames(test) != "isFraud"]), label = test$isFraud)

# Get the time to train the xgboost model
xgb.bench.speed = microbenchmark(
    xgb.model.speed <- xgb.train(data = xgb.data.train
        , params = list(objective = "binary:logistic"
            , eta = 0.1
            , max.depth = 3
            , min_child_weight = 100
            , subsample = 1
            , colsample_bytree = 1
            , nthread = 3
            , eval_metric = "auc"
            )
        , watchlist = list(test = xgb.data.test)
        , nrounds = 500
        , early_stopping_rounds = 40
        , print_every_n = 20
        )
    , times = 5L
)

## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818 
## [1]  test-auc:0.693403 
## Will train until test_auc hasn't improved in 40 rounds.
## 
## [21] test-auc:0.717561 
## [41] test-auc:0.725942 
## [61] test-auc:0.729602 
## [81] test-auc:0.733082 
## [101]    test-auc:0.735312 
## [121]    test-auc:0.736935 
## [141]    test-auc:0.738546 
## [161]    test-auc:0.739718 
## [181]    test-auc:0.740832 
## [201]    test-auc:0.741978 
## [221]    test-auc:0.742289 
## [241]    test-auc:0.743184 
## [261]    test-auc:0.743543 
## [281]    test-auc:0.744220 
## [301]    test-auc:0.744372 
## [321]    test-auc:0.745138 
## [341]    test-auc:0.745831 
## [361]    test-auc:0.745897 
## [381]    test-auc:0.746428 
## [401]    test-auc:0.747213 
## [421]    test-auc:0.747397 
## [441]    test-auc:0.747453 
## [461]    test-auc:0.747624 
## [481]    test-auc:0.747817 
## [500]    test-auc:0.747818

print(xgb.bench.speed)

## Unit: seconds
##                                                                                                                                                                                                                                                                                                                                                   expr
##  xgb.model.speed <- xgb.train(data = xgb.data.train, params = list(objective = "binary:logistic",      eta = 0.1, max.depth = 3, min_child_weight = 100, subsample = 1,      colsample_bytree = 1, nthread = 3, eval_metric = "auc"),      watchlist = list(test = xgb.data.test), nrounds = 500, early_stopping_rounds = 40,      print_every_n = 20)
##       min       lq     mean   median       uq     max neval
##  2.623388 2.633211 2.704979 2.735801 2.765213 2.76728     5

print(xgb.model.speed$bestScore)

## NULL

# Make predictions on test set for ROC curve
xgb.test.speed = predict(xgb.model.speed
                   , newdata = as.matrix(test[, colnames(test) != "isFraud"])
                   , ntreelimit = xgb.model.speed$bestInd)
#auc.xgb.speed = roc(test$Class, xgb.test.speed, plot = TRUE, col = "blue")
#print(auc.xgb.speed)

ROCPred <- prediction(xgb.test.speed, test$isFraud) 
ROCPer <- performance(ROCPred, measure = "tpr", 
                      x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.7478539

# Plotting curve
plot(ROCPer)

plot(ROCPer, colorize = TRUE, 
     print.cutoffs.at = seq(0.1, by = 0.1), 
     main = "ROC CURVE XGBOOST")
abline(a = 0, b = 1)

auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

We can see that XGBoost Classifier gave way better results than the baselines models:

We did it! The accuracy jumped from 65% to 75% with a clear balance between Precision and Recalls. The reason this accuracy is great for our given use case is that the inherent problem is hard and the attributes in the data set are not any synthetic data prepared for a competition. As in the real-world scenarios, lots of domain knowledge is incorporated and then the model is put into the production and is continuously monitored. But what it clearly says is, that it is nearly impossible to predict frauds in the current existing algorithms to be 90%. But if we are able to predict up to 75%, that means we can highlight the transactions and can clearly take relevant actions immediately to mitigate the fraud rates, which will inherently keep on enhancing our labeled data and maturing the model which may lead to 80% in accuracy, but the 90% accuracy is still a dream in the fraud system world.

It can clearly be seen that XGBoost Trees outperformed all the other classifiers

Unsupervised Technique: Principle Component Analysis

There’s a few pretty good reasons to use PCA that how one would plot multi-dimensional data by using PCA into 2 dimensions, we actually capture 55% variance with 4 PCAs and 27% with 2 PCAs (Dim1 13.9% + Dim2 13.1%). It is pretty good when taking into consideration that the original data consisted of 29 features which would be impossible to plot in any meaningful way.

df_under.pca <- prcomp(df_under[,c(1:15)], center = TRUE,scale. = TRUE)
summary(df_under.pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4445 1.4019 1.2840 1.09533 1.05290
## Proportion of Variance 0.1391 0.1310 0.1099 0.07998 0.07391
## Cumulative Proportion  0.1391 0.2701 0.3800 0.46001 0.53392
##                            PC6     PC7     PC8     PC9
## Standard deviation     1.00436 1.00003 0.99841 0.98256
## Proportion of Variance 0.06725 0.06667 0.06646 0.06436
## Cumulative Proportion  0.60116 0.66783 0.73429 0.79865
##                           PC10    PC11    PC12    PC13
## Standard deviation     0.97034 0.95195 0.84932 0.64800
## Proportion of Variance 0.06277 0.06041 0.04809 0.02799
## Cumulative Proportion  0.86142 0.92184 0.96993 0.99792
##                           PC14               PC15
## Standard deviation     0.17661 0.0000000000001122
## Proportion of Variance 0.00208 0.0000000000000000
## Cumulative Proportion  1.00000 1.0000000000000000

library(devtools)
install_github("vqv/ggbiplot")
library(ggbiplot)

ggbiplot(df_under.pca, labels=rownames(df_under))

The sole purpose here for this algorithm is to analyze if let’s say in the future we do not have the true ground labels for some entries then how the existing features can still explain the variance in the data set. And it can be seen that the first four components are able to explain 55% of the variance in the dataset. That means with the continuous effort the data acquisition could be done to enhance the variance of features with respect to the response variable, by incorporating business knowledge.

7. Results, Analysis, and Discussion

We carried out multiple modules of data science pipeline which involved: Data Visualizations, Descriptive Statistics, Prescriptive Statistics, Data wrangling, Feature engineering, Predictive Modelling, Interpretation of the model, comparing results and creating impactful solution for our use-case which can be used for decision making.
Classification is the best choice whenever we have a given class label to model around.
We started by analyzing the data set and could clearly see that:
Most Credit Cards have 5000 as their limit amount.
There are 3 Types of Transaction types: Purchase, Reversal, and Address Verification, and most values lie among the Purchase types of transaction
We also saw and verified through feature importance of Random Forest and Decision Trees that if the card is Present during the transaction or not, plays an important role in predicting frauds.
The list of top merchants are: Uber, Lyft, oldnavy.com, staples.com, alibaba.com, apple.com, walmart.com, cheapfast.com, ebay.com, target.com
The top type of transactions seen are in these areas: Online Retail, Food, Entertainment, Online Gifts, Ride Share, Hotels
Available Money mostly lies under the buckets of less than 10,000. It is Right Skewed, which actually makes sense in the distribution of amount figures.
The Merchant Names where most Fraud happens are: Lyft, ebay.com, Fresh Flowers, Uber, walmart.com
The Customers and account IDs that get the most frauds are: 380680241, 782081187, 246251253
The Top 3 Merchants where duplicated transactions happened and Frauds took place are: Fresh Flowers, Lyft, ebay.com
The additional features added through feature engineering were very predictive, especially the time zones of the day.
The under-sampling technique was chosen over the over-sampling technique because it produced better predictive models as it is hard to over-sample 1.6% of the class to match the 98.4% of the regular class. So, under-sampled performed better in the fraud cases.
It can be seen that the logistic regression model tried to draw the boundary between the fraud and non-fraud data set and could perform with 65% of accuracy on the given attributes. Here, the problem is inherently challenging as the real-world transaction data-set may have 100 and 1000s of predictive attributes (and it is not a synthetic data set, so it is an expected result on a real-day dataset), and on the given attributes 65% is still a significant number with a great precision and recall ratio.
The result of KNN is very much comparable to the Logistic regression technique. So, it can clearly be seen that we need to add more variance in the model, which we are going to do through Tree-based models. The simple reason is that to still have the interpretability from a business point of view that which features carry more predictiveness and help in decision making.
Logistic Regression and K-Nearest Neighbor produced comparable accuracy results which were 65% and further on these baselines were enhanced by adding more variance in the models.

Balanced Accuracy of each Model

Logisitc_Regression <- c("65.3%")
KNN <- c("65.9%")
XGBoost <- c("75%")
results <- data.frame(XGBoost, KNN, Logisitc_Regression)
results

Then we tried out tree-based classifiers for better accuracy, variance, and interpretability.
The decision tree provides us with the interpretability regarding the decision made over the feature set, which may lead to good accuracy and shows a clear hierarchy.
The decision tree clearly shows how important the transaction amount is, then the predictive power of posEntryMode and later on, if the card is present or not, has a high impact on the response variable “isFraud”.
Ensembled and stacking-based algorithms are used such as Random Forest for evaluating Feature Importance as it is more robust than a decision tree, and does not easily overfit.
Random Forest clearly identifies important features with high impact, the Top 5 are Transaction Amount, Merchant Name, Current Balance, Available Money, and POS Entry Mode.
In the last we used XGBoost. The accuracy jumped from 65% to 75% with a clear balance between Precision and Recalls. The reason this accuracy is great for our given use case is that the inherent problem is hard and the attributes in the data set are not any synthetic data prepared for a competition. As in the real-world scenarios, lots of domain knowledge is incorporated and then the model is put into the production and is continuously monitored. But what it clearly says is, that it is nearly impossible to predict frauds in the current existing algorithms to be 90%. But if we are able to predict up to 75%, that means we can highlight the transactions and can clearly take relevant actions immediately to mitigate the fraud rates, which will inherently keep on enhancing our labeled data and maturing the model which may lead to 80% in accuracy, but the 90% accuracy is still a dream in the fraud system world. It can clearly be seen that XGBoost Trees outperformed all the other classifiers.
We checked on various Metrics such as Precision, Recall, Accuracy, F1 Scores, and Balanced Accuracy. Our Best Model’s test performance is on XGBOOST Classifier which gives 75% accuracy.
Then for unsupervised analysis, we used Principal Component Analysis. The sole purpose of this algorithm is to analyze if let’s say in the future we do not have the true ground labels for some entries then how the existing features can still explain the variance in the data set. And it can be seen that the first four components are able to explain 55% of the variance in the dataset. That means with the continuous effort the data acquisition could be done to enhance the variance of features with respect to the response variable, by incorporating business knowledge.
Neural Networks are not used for simplicity and to not get into the black-boxed domain as interpretations of the results become harder. Else, a multi-layer perceptron could also be trained, would not recommend CNN or RNN for this use case. Though some people use RNNs for a contextual understanding of the fraud behavior.
In the data set we had clearly seen that there were some Customer IDs that were highly fraudulent. So, a user-profiling-based statistical model would significantly improve the results.
Important data acquisition ( either by the client, business teams, or an external API) and rigorous feature engineering.
Better way to fill the missing values in columns, based on the distribution of the data column
Better way to handle outliers. In the case of a large number of records, people use Auto-encoders as well.
Exploring neural networks side for modeling and particularly Recurrent Neural Networks to see if there can be built context between next and previous fraudulent transactions.
Checking for data quality in terms of days like Cyber-Monday or Thanksgiving or any sales.
Time series-based anomaly detection in the dataset and then modeling it appropriately

8. Impact:

The Power of the data analysis done here answers a bunch of questions in the field of fraud analysis. It does not straightly go into the modeling to explain which transaction is fraudulent or not but comprehensively explains what are the type of merchants we have in the use case, which type of merchants are experiencing more frauds, which region and customers are the victims of most frauds, which industry should be mainly focussed or prioritized from a business point of view. How much money we are losing to fraudulent transactions. What impact do the reverse, multi-swiped, and duplicate transactions have on the overall analysis? All of these solutions can help any upper management to straight away take revolutionizing decisions for their businesses.
Not only this, the power of our machine learning models compliments the data analysis and findings. As simple baseline models are truly easily interpretable by non-scientific stakeholders. Also, the choice of using tree-based models was to enhance the transparency in the business terms so that the stakeholders can clearly analyze which attributes and departments are affecting more in a hierarchical structure in terms of fraud cases so they can fix them right away and allocate more security and monitoring teams in that area. Priority assignment has a major impact on the businesses and our analysis and models give the liberty to end-users to easily take decisions in a hierarchical manner to mitigate fraud.
And what happens when they can easily make strategies? The online and digital world become free of muggers and thieves and can significantly improve POS and e-commerce platforms like Amazon, and Walmart and can help banks grow better in revenue as they lose billions of dollars through fraudulent transactions.

======================================================================================================================================

Data Analysis of Digital Payments and Fraud Detection

Team: TAYYAB BIN TAHIR & HARSHA SIDDAGANGAIAH

1. About Data

Structure of Data:

2. Questions

3. Data Visualization and Answers to Questions:

This and follwing sections will try to answer the questions mentioned above with various graphs that may include: Plots, Histograms, Box Plots, Correlation Matrix and Hypothesis about the Structure of Data.

Buckets/Unique Values in Credit Card Limit Column are shown below:

Question:

4 Types of Merchant Country Code exists in the dataset, where US is a dominant entity as shown.

5 POS Entery Mode exists where “05”, “09”, and “02” covers most of the values.

3 POS Condition Mode exists where “01” is dominant among all.

Question:

There are 3 Types of Transaction types:

Question:

List of top merchants

Question:

Top Transaction Types, in terms of their name, to understand the behavior of data.

Question:

Let’s explore Numerical Valued Attributes in Credit Card Transactions these attributes hold high importance because they directly impact the response variable isFraud.

Important Note:

Question:

Box Plots help analyzing the data distribution and beautifully caters Outliers in the datasets and the skewness.

Most Fradulant Merchant Types

Question:

Top 3 Merchant Categories are:

Question:

The Merchant Names where most Fraud happens:

Top 5 are:

Question:

The Customers and account IDs which get most frauds can also be analyzed through dataset.

Top 3 Account IDs are:

Important Note: Account IDs and Customer IDs are the same.

Correlation Matrix between Numerical Attributes

4. Data Wrangling and further analysis

Reason for removing these 6 columns:

Question:

Analyzing Duplicate, Reveresal and Multi-Swiped Transactions

Adding Date Column in the Data Frame to compute Results independent of just Date-Time

Duplicate transaction

Question:

The attributes used are:

Duplicated Trnsactions = 100716

Question:

The Top 3 Merhcants where duplicated transactions happened and Frauds took place are:

Most Frauds Happening within Original data and reversal transaction type are at these places:

Question:

Most Frauds that happen in Reversal Transactions of Duplicated Subset are for these Merchants:

Removing Outliers

Question:

Method used to handle Outliers: Interquartile Ranges are generally used for removing the outliers in a numerical attribute

Data Visualization after handling outliers for better generalization.

Box Plots are again drawn to help visualize how outliers were impacting the overall skewness.

4. Feature Engineering

Making an additional Column derived from the equality of CardCVV and EnteredCVV as shown

Transormation from Categorical and Booleans to Numerical

Making 4 set of values from time column: morning, afternoon/evening, night and late/mid night are also added as feature.

No of Transaction based on Four timing range

Count of CVV Matched and not mached

Here we are removing the attributes which are unique identifiers like Customer IDs, Account IDs etc.

Categorical to Numerical Fitting through LabelEncoder

5. Handling Imbalance of Data Set

Question

As the Data is Imbalanced, so to run classification models efficiently it is advised to use a Down or Upper Sampling technique to balance the dataset for better results

Using OverSampling Technique

Using UnderSampling Technique

6. Modeling:

For Model Evaluation Always Train the model on Training dataset and Verify on test data sets.

Baseline: Logistic Regression

Dimentions of Train data

Dimentions of Test data

The Results of Logisitc Regression Classifier:

Accuracy = 0.6535 = 65.35%

Alternative Basline: K-Nearest Neighbor

The hyperparameter is tuned between 5-9 of K values, which is bes suited for our use-case.

The Results of K nearest neighbour:

Accuracy = 0.6590 = 65.90%

Tree-Based Classifiers for better Accuracy and Interpretability

Decision Tree to identify important predictive features to enhance the variance of the model.

The decision tree provides us with the interpretability regarding the decision made over the feature set, which may lead to good accuracy and shows a clear hierarchy