This synthetically generated dataset consists of payments from various customers made in different time periods and with different amounts. For more information on the dataset you can check the Kaggle page for this dataset:

[https://www.kaggle.com/code/turkayavci/fraud-detection-on-bank-payments/notebook]

Data As we can see in the first rows below the dataset has 9 feature columns and a target column. The feature columms are :

Step: This feature represents the day from the start of simulation. It has 180 steps so simulation ran for virtually 6 months.
Customer: This feature represents the customer id.
zipCodeOrigin: The zip code of origin/source.
Merchant: The merchant’s id.
zipMerchant: The merchant’s zip code.

Age: Categorized age
0: <= 18,
1: 19-25,
2: 26-35,
3: 36-45,
4: 46:55,
5: 56:65,
6: > 65
U: Unknown

Gender: Gender for customer
E: Enterprise,
F: Female,
M: Male,
U: Unknown
Category: Category of the purchase. I won’t write all categories here, we’ll see them later in the analysis.
Amount: Amount of the purchase.
Fraud: Target variable which shows if the transaction fraudulent(1) or benign(0)

library(dplyr)
library(tidyr)
library(corrplot)
library(caret)
library(rsample)
#library(rpart)  # for decision tree
#library(rpart.plot)
#library(rattle) # for plotting decision tree
#library(e1071)  # For Naive Bayes classifier
library(naivebayes)
#library(randomForest)
library(pROC)

1. Import Data & Preprocessing

transactions = read.csv(paste0(getwd(),"/bs140513_032310.csv"))
head(transactions)
##   step      customer age gender zipcodeOri      merchant zipMerchant
## 1    0 'C1093826151' '4'    'M'    '28007'  'M348934600'     '28007'
## 2    0  'C352968107' '2'    'M'    '28007'  'M348934600'     '28007'
## 3    0 'C2054744914' '4'    'F'    '28007' 'M1823072687'     '28007'
## 4    0 'C1760612790' '3'    'M'    '28007'  'M348934600'     '28007'
## 5    0  'C757503768' '5'    'M'    '28007'  'M348934600'     '28007'
## 6    0 'C1315400589' '3'    'F'    '28007'  'M348934600'     '28007'
##              category amount fraud
## 1 'es_transportation'   4.55     0
## 2 'es_transportation'  39.68     0
## 3 'es_transportation'  26.89     0
## 4 'es_transportation'  17.25     0
## 5 'es_transportation'  35.72     0
## 6 'es_transportation'  25.81     0

Now lets look at the structure of th imported data.

str(transactions)
## 'data.frame':    594643 obs. of  10 variables:
##  $ step       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer   : chr  "'C1093826151'" "'C352968107'" "'C2054744914'" "'C1760612790'" ...
##  $ age        : chr  "'4'" "'2'" "'4'" "'3'" ...
##  $ gender     : chr  "'M'" "'M'" "'F'" "'M'" ...
##  $ zipcodeOri : chr  "'28007'" "'28007'" "'28007'" "'28007'" ...
##  $ merchant   : chr  "'M348934600'" "'M348934600'" "'M1823072687'" "'M348934600'" ...
##  $ zipMerchant: chr  "'28007'" "'28007'" "'28007'" "'28007'" ...
##  $ category   : chr  "'es_transportation'" "'es_transportation'" "'es_transportation'" "'es_transportation'" ...
##  $ amount     : num  4.55 39.68 26.89 17.25 35.72 ...
##  $ fraud      : int  0 0 0 0 0 0 0 0 0 0 ...

We notice that the character fields are within single quotes ''. Remove the ' character around each entry in the data.

charcols = sapply(transactions,class) == "character"
print(colnames(transactions[charcols]))
## [1] "customer"    "age"         "gender"      "zipcodeOri"  "merchant"   
## [6] "zipMerchant" "category"
transactions[charcols] <- lapply(transactions[charcols], gsub, pattern="'", replacement='')
head(transactions)
##   step    customer age gender zipcodeOri    merchant zipMerchant
## 1    0 C1093826151   4      M      28007  M348934600       28007
## 2    0  C352968107   2      M      28007  M348934600       28007
## 3    0 C2054744914   4      F      28007 M1823072687       28007
## 4    0 C1760612790   3      M      28007  M348934600       28007
## 5    0  C757503768   5      M      28007  M348934600       28007
## 6    0 C1315400589   3      F      28007  M348934600       28007
##            category amount fraud
## 1 es_transportation   4.55     0
## 2 es_transportation  39.68     0
## 3 es_transportation  26.89     0
## 4 es_transportation  17.25     0
## 5 es_transportation  35.72     0
## 6 es_transportation  25.81     0

Verify the structure of data…

str(transactions)
## 'data.frame':    594643 obs. of  10 variables:
##  $ step       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer   : chr  "C1093826151" "C352968107" "C2054744914" "C1760612790" ...
##  $ age        : chr  "4" "2" "4" "3" ...
##  $ gender     : chr  "M" "M" "F" "M" ...
##  $ zipcodeOri : chr  "28007" "28007" "28007" "28007" ...
##  $ merchant   : chr  "M348934600" "M348934600" "M1823072687" "M348934600" ...
##  $ zipMerchant: chr  "28007" "28007" "28007" "28007" ...
##  $ category   : chr  "es_transportation" "es_transportation" "es_transportation" "es_transportation" ...
##  $ amount     : num  4.55 39.68 26.89 17.25 35.72 ...
##  $ fraud      : int  0 0 0 0 0 0 0 0 0 0 ...

Values of zipcodeOri and zipMerchant are same throughout the dataa so might as well drop them.

Unique values of zipcodeOri

unique(transactions$zipcodeOri)
## [1] "28007"

Unique values of zipMerchant

unique(transactions$zipMerchant)
## [1] "28007"

Since there is one value for zipMerchant and zipcodeOri, we can remove these fields.

transactions = transactions %>% dplyr::select(-c(zipMerchant, zipcodeOri))

Now lets look at the data for NULL values in any attributes.

colSums(is.na(transactions))
##     step customer      age   gender merchant category   amount    fraud 
##        0        0        0        0        0        0        0        0

No missing observations are present in the data. Before moving to the next step, lets look at the size of the data

dim(transactions)
## [1] 594643      8

2. Univariate Analysis

In this section, we analyse each variable individually to gain insight from it. We explore the distribution properties of the variables and also their relationship with Fraud Rate.

Fraud

‘fraud’ is a binary indicator denoting whether a transaction is fraudulent or not. We expect an overwhelming majority of the transactions to be not Fraud.

print("summary statistics for variable Fraud:")
## [1] "summary statistics for variable Fraud:"
cat("\n")
summary(as.factor(transactions$fraud))
##      0      1 
## 587443   7200
transtype = table(transactions$fraud) %>% as.data.frame() %>% rename("fraud"="Var1")

ggplot(transtype, aes(x = fraud, y = Freq, fill = fraud)) +
  geom_bar(stat = "identity", color = "black") +
  labs(x = "fraud", y = "Count", title = "Bar Chart of Distribution by fraud") +
  theme_minimal()

Fraud data will be imbalanced like you see in the plot above and from the count of instances. To balance the dataset one could perform oversample or undersample technique, although no such method has been adopted for the purpose of this model exercise..

gender

Unique values of gender are

unique(transactions$gender)
## [1] "M" "F" "E" "U"

Lets look at the distribution by each level of gender, including the fraud rate. It helps us understand if Fraudsters choose specific gender for fraud operations.

gender = group_by(transactions, gender) %>%  summarise(Freq=n(), fraudcount = sum(fraud), fraudrate=sum(fraud)/n()) %>% as.data.frame() %>% mutate(Perc = Freq/sum(Freq))
print("Distribution for variable gender:")
## [1] "Distribution for variable gender:"
cat("\n")
gender
##   gender   Freq fraudcount   fraudrate         Perc
## 1      E   1178          7 0.005942275 0.0019810205
## 2      F 324565       4758 0.014659621 0.5458148839
## 3      M 268385       2435 0.009072787 0.4513380297
## 4      U    515          0 0.000000000 0.0008660659

From the data it looks like the number of fraud cases is twice for Females as compared to males. But overall, females also performed more number of banking transactions. However, if we look at the fraudrate (count of frauds to total count of transactions), we see that likelihood of a female customer being a victim of fraud transaction is still higher compared to males.

Lets visualize this using a histogram and line graph.

ggplot(gender, aes(x = gender, y = Perc)) +
  geom_bar(stat = "identity", col = "black", aes(fill=gender)) +
  geom_line(aes(x = gender, y = 40*fraudrate, group = 1), color="red", stat = "identity")+
  labs(x = "gender", y = "Count", title = "Distribution/fraud by gender") +
  scale_y_continuous(sec.axis=sec_axis(~./40*100,name="%Frauds")) 

So in conclusion, we observe that majority of transactions are made by males and females, with females showing more likelihood of fraud than males. Enterprise transactions have 3rd highest prevalence of fraud transaction among the genders.

age

First lets look at the unique values of age.

unique(transactions$age)
## [1] "4" "2" "3" "5" "1" "6" "U" "0"

Next we look at the distribution of variable age. Age is a categorical variable, so the univariate analysis involves looking at the distribution of the distinct values and the trend of % fraud among them.

age = group_by(transactions, age) %>%  summarise(Freq=n(), fraudcount=sum(fraud), fraudrate=sum(fraud)/n()) %>% as.data.frame() %>% mutate(Perc = Freq/sum(Freq))
print("summary statistics for variable age:")
## [1] "summary statistics for variable age:"
cat("\n")
age
##   age   Freq fraudcount   fraudrate        Perc
## 1   0   2452         48 0.019575856 0.004123482
## 2   1  58131        689 0.011852540 0.097757814
## 3   2 187310       2344 0.012514014 0.314995720
## 4   3 147131       1755 0.011928146 0.247427448
## 5   4 109025       1410 0.012932814 0.183345301
## 6   5  62642        686 0.010951119 0.105343879
## 7   6  26774        261 0.009748263 0.045025335
## 8   U   1178          7 0.005942275 0.001981021

Lets visualise to understand this table further.

ggplot(age) +
  geom_bar(aes(x = age, y = Perc), stat = "identity", col = "black", fill='skyblue') +
  geom_line(aes(x = age, y = 10*fraudrate, group = 1), color="red", stat = "identity")+
  labs(x = "age", y = "Count", title = "Distribution/Fraudrate by age") +
  scale_y_continuous(sec.axis=sec_axis(~./10,name="%Frauds")) 

Majority of transactions are made by people in the age groups 2,3,4, which represents the age 26-55. The age group 0 has the highest risk of fraud at around 2%. The percentage of frauds are almost constant across groups 1 to 4, and then its falls slightly in groups 5 and 6. Hence it could mean that older people are less likely to be a victim of fraud whereas very young people (<18) are highly likely.

amount

Before performing univariate analysis, we find the cases where transaction amount <= 0. This could be potential data issues.

sum(transactions$amount==0)
## [1] 52

There are 52 records where this happens, so we need to remove these observations also.

transactions = transactions %>% dplyr::filter(amount>0)

Now lets look at the summary statistics of amount variable.

print("summary statistics for variable amount:")
## [1] "summary statistics for variable amount:"
cat("\n")
summary(transactions$amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01   13.74   26.90   37.89   42.54 8329.96

The summary statistics output indicates that the distribution of amount could be right skewed as Mean>Median. Now lets visualize the distribution of the variable using a histogram.

From the distribution of the variable, it can be observed that the variable amount is heavily right skewed.Note that the plot is right censored to account for extremely large valus.

Log transformation of amount To get a better view at the data, we can study the distribution of log(amount), as the logarithmic transformation will make the right skewed data more symmetric.

Clearly, log(amount) offers a better view of the distribution of the variable. This could indicate a potential new feature that could be derived to be used in the model.

Now lets look at the distribution of log(amount) by fraud to see if there is difference in distribution between fraud transactions and normal transactions.

ggplot(data = transactions,aes(x = log(amount), y=after_stat(density), fill = factor(fraud))) +
# geom_density(alpha=0.6, position = 'identity') +
geom_histogram(alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
labs(fill="", title=paste0("Histogram of log(amount) by fraud"), x='log(amount)', y = "density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can observe that the distribution of amount in fraud transactions is significantly towards the right when compared to normal transactions, indicating a differentiation between transaction amounts of fraud and non fraud cases. Larger transaction amounts are possible warning signals for possible fraud.

category

Next we look at the distribution of variable transaction type. It is a categorical variable, so the univariate analysis involves looking at the distribution of the distinct values of the attribute.

transtype = group_by(transactions, category) %>%  
  summarise(Freq=n(),  fraudcount=sum(fraud), fraudrate=sum(fraud)/n()) %>% as.data.frame() %>% mutate(Perc = Freq/sum(Freq))
print("summary statistics for variable category:")
## [1] "summary statistics for variable category:"
cat("\n")
transtype
##                 category   Freq fraudcount  fraudrate         Perc
## 1  es_barsandrestaurants   6373        120 0.01882944 0.0107182921
## 2            es_contents    885          0 0.00000000 0.0014884181
## 3             es_fashion   6454        116 0.01797335 0.0108545202
## 4                es_food  26250          0 0.00000000 0.0441479942
## 5              es_health  16133       1696 0.10512614 0.0271329368
## 6                es_home   1986        302 0.15206445 0.0033401111
## 7       es_hotelservices   1744        548 0.31422018 0.0029331086
## 8               es_hyper   6098        280 0.04591669 0.0102557893
## 9             es_leisure    499        474 0.94989980 0.0008392323
## 10      es_otherservices    912        228 0.25000000 0.0015338275
## 11      es_sportsandtoys   4002       1982 0.49525237 0.0067306771
## 12               es_tech   2370        158 0.06666667 0.0039859332
## 13     es_transportation 505071          0 0.00000000 0.8494427262
## 14             es_travel    728        578 0.79395604 0.0012243710
## 15  es_wellnessandbeauty  15086        718 0.04759380 0.0253720625

The count of observations in each value for this attribute can be visualised through a barchart. The barchart can be generated using ‘geom_bar( )’ command in ggplot2 package.

The barchart is supplemented by a line graph of the % of fraud cases in each category.

ggplot(transtype) +
  geom_bar(aes(x = category, y = Perc), col = "black", stat = "identity") +
  geom_line(aes(x = category, y = fraudrate, group = 1), color="red", stat = "identity")+
  labs(x = "Transaction category", y = "Percentage", title = "Distribution/Fraudrate by category") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(sec.axis=sec_axis(~.,name="%Frauds")) 

The graph shows that es_leisure, es_sportsandtoys, and es_travel have high prevalence of fraud cases.

We observe that ‘es_transportation’ has majority of the observations yet low fraud cases.To get a view of the distribution of the remaining categories, we may observe the barplot after removing this category from the data.

ggplot(subset(transtype, category!="es_transportation")) +
  geom_bar(aes(x = category, y = Perc), col = "black", stat = "identity") +
  geom_line(aes(x = category, y = 0.04*fraudrate, group = 1), color="red", stat = "identity")+
  labs(x = "Transaction category", y = "Percentage", title = "Distribution/Fraudrate by category (excl es_transportation)") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(sec.axis=sec_axis(~./0.04,name="%Frauds")) 

There are more credit card transactions in the categories es_food, es_health and es_wellnessandbeauty . But the same categories also have lower prevalence of fraud transactions. There are very few credit card transactions in categories with high prevalence of fraud - es_leisure, es_sportsandtoys, and es_travel

3. Bivariate Analysis

In Bivariate analysis we look at the strength of relationship of variables with each other. We look at the relationship between the numeric variables using correlation matrix.

amount vs step

The correlation between variables amount and step is negligible.

##                step       amount
## step    1.000000000 -0.007959275
## amount -0.007959275  1.000000000

The two variables have very low correlation with each other. Therefore the two variables are not expected to interfere with each other during the modelling stage. Similarly, we can design a Bivariate analysis for the categorical variables.

category vs age

Since correlation is only meaningful for continuous variables, we can define the distribution of Transaction category by age.

cat_by_age = table(transactions$category, transactions$age) %>% 
                    as.data.frame() %>% 
                    rename("category"="Var1", "age"="Var2") %>% 
                   # group_by(IsFraud) %>% 
                    mutate(Perc = Freq/sum(Freq) * 100) 

cat_by_age[c("category", "age", "Perc")] %>% 
        pivot_wider(names_from = age, values_from = Perc, names_prefix = "age =")
## # A tibble: 15 × 9
##    category       `age =0` `age =1` `age =2` `age =3` `age =4` `age =5` `age =6`
##    <fct>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 es_barsandres… 0.00538   0.102     0.352    0.250    0.194   0.114    0.0516 
##  2 es_contents    0.000673  0.0167    0.0474   0.0395   0.0244  0.0141   0.00538
##  3 es_fashion     0.00639   0.103     0.332    0.281    0.204   0.113    0.0432 
##  4 es_food        0.0207    0.435     1.38     1.13     0.800   0.457    0.175  
##  5 es_health      0.0190    0.282     0.854    0.681    0.497   0.257    0.116  
##  6 es_home        0.00202   0.0330    0.103    0.0817   0.0659  0.0336   0.0148 
##  7 es_hotelservi… 0.000841  0.0299    0.0928   0.0720   0.0537  0.0308   0.0119 
##  8 es_hyper       0.00420   0.0910    0.326    0.244    0.193   0.111    0.0542 
##  9 es_leisure     0.000841  0.00975   0.0227   0.0232   0.0148  0.00908  0.00336
## 10 es_otherservi… 0.00135   0.0150    0.0447   0.0436   0.0298  0.0119   0.00673
## 11 es_sportsandt… 0.00269   0.0634    0.213    0.161    0.136   0.0658   0.0309 
## 12 es_tech        0.00269   0.0404    0.127    0.0994   0.0727  0.0385   0.0167 
## 13 es_transporta… 0.330     8.32     26.7     21.0     15.6     9.01     3.86   
## 14 es_travel      0.00101   0.0109    0.0400   0.0321   0.0230  0.00975  0.00555
## 15 es_wellnessan… 0.0150    0.227     0.819    0.630    0.476   0.255    0.111  
## # ℹ 1 more variable: `age =U` <dbl>

Distribution of Transaction Category vs age

  ggplot(data = cat_by_age, aes(x = age, y = category, fill = Perc)) +
  geom_tile()+
  scale_fill_gradient(low = "white", high = "darkred")+
  labs(x = "age", y = "category", title = "Heatmap") 

Majority of spending is in transportation in all age groups and especially in groups 2,3 and 4. lets remove the records on transportation and see if the heat maps reveal any relationships.

ggplot(data = subset(cat_by_age, category!="es_transportation"), aes(x = age, y = category, fill = Perc)) +
  geom_tile()+
  scale_fill_gradient(low = "white", high = "darkred")+
  labs(x = "category", y = "age", title = "Heatmap (excl es_transportation)") 

From this graph and table, we infer that for almost all age groups, people spend majority of transactions for health, food and wellness&beauty. Categories with High frequency of transactions could be potential targets for fraudsters.

category vs amount

Distribution of Transaction Category vs amount

We can see the mean amount and fraud percent by category below. Fraudsters may chose the categories which people spend more on average. Let’s confirm this hypothesis by checking the average amount transacted in each category.

transactions %>% group_by(category) %>% summarise(MeanAmount=mean(amount), fraudrate=sum(fraud)/n())
## # A tibble: 15 × 3
##    category              MeanAmount fraudrate
##    <chr>                      <dbl>     <dbl>
##  1 es_barsandrestaurants       43.5    0.0188
##  2 es_contents                 44.5    0     
##  3 es_fashion                  65.7    0.0180
##  4 es_food                     37.1    0     
##  5 es_health                  136.     0.105 
##  6 es_home                    166.     0.152 
##  7 es_hotelservices           206.     0.314 
##  8 es_hyper                    46.0    0.0459
##  9 es_leisure                 289.     0.950 
## 10 es_otherservices           136.     0.25  
## 11 es_sportsandtoys           216.     0.495 
## 12 es_tech                    121.     0.0667
## 13 es_transportation           27.0    0     
## 14 es_travel                 2250.     0.794 
## 15 es_wellnessandbeauty        65.5    0.0476

In categories like leisure and travel which are the most selected categories for fraudsters, we can see that transaction amount is also significantly high. What this means is that fraudsters are likely to choose travel related transactions(example flight/train bookings or tours) for fraud. Travel related transaction in particular marks a significatly high average transaction value.

ggplot(data = transactions) +
  geom_boxplot(aes(x = category, y = amount))+
    theme(axis.text.x = element_text(angle = 90))+
      labs(title = "Boxplot category vs amount") +
        ylim(0, 3000)
## Warning: Removed 222 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

We can also look at distribution of amount by category after excluding es_travelfor better analysis of remaining categories.

ggplot(data = subset(transactions, category!="es_travel")) +
  geom_boxplot(aes(x = category, y = amount))+
    theme(axis.text.x = element_text(angle = 90))+
      ylim(0, 1000) +
        labs(title = "Boxplot category vs amount (excl es_travel)") 
## Warning: Removed 146 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

After removing the category es_travel, we observe that majority of the transactions are within transaction amount of 500. We observe few outliers in each category but even the max value of outliers is less than 2000.

Other than travel, next highest median spending is for leisure, sports and toys and hotel service, all of which have significantly high cases of fraud. Hence, our hypothesis that fraudsters choosing the categories which people spend more is only partly correct.

age vs amount

Distribution of age vs amount

The red dot is the mean of each age group.

ggplot(data = transactions, aes(x = age, y = amount)) +
  geom_boxplot()+
  labs(title = "Boxplot age vs amount") +
    geom_point(data = transactions %>% group_by(age) %>% summarise(Mean=mean(amount)),
             aes(x = age, y = Mean), color = "red", size = 3) +
  ylim(0, 100) 
## Warning: Removed 20555 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The boxplot has been trimmed to accommodate only the range bars and exclude some of the outliers because to enhance visibility. The boxplot shows that median transaction amount is similar across groups. Also, the initerquartlie ranges are similar because the boxes are comparable in size.

The mean transaction amount is highest for group 0 (<18). This age group is also more likely to be a victim of fraud as compared to others. This could indicate that fraudsters target certain age group more than others.

gender vs amount

Distribution of gender vs amount

The red dot is the mean of each gender.

transactions %>% group_by(gender) %>% summarise(Mean=mean(amount), Median = median(amount))
## # A tibble: 4 × 3
##   gender  Mean Median
##   <chr>  <dbl>  <dbl>
## 1 E       36.7   27.2
## 2 F       39.2   27  
## 3 M       36.3   26.8
## 4 U       31.5   25.2
ggplot(data = transactions, aes(x = gender, y = amount)) +
  geom_boxplot()+
  labs(title = "Boxplot gender vs amount") +
    geom_point(data = transactions %>% group_by(gender) %>% summarise(Mean=mean(amount)),
             aes(x = gender, y = Mean), color = "red", size = 3) +
  ylim(0,100)
## Warning: Removed 20555 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The boxplot has been trimmed to accommodate only the range bars and exclude some of the outliers because to enhance visibility. The boxplot shows that median transaction amount is similar across groups. Also, the inter quartile ranges are similar because the boxes are comparable in size. The mean and median transaction amount is highest for females but not significantly different from males.

The same can be understood with the help of boxplots after logarithmic transformation - log(amount)

ggplot(data = transactions, aes(x = gender, y = log(amount))) +
  geom_boxplot()+
  labs(title = "Boxplot gender vs log(amount)") +
    geom_point(data = transactions %>% group_by(gender) %>% summarise(Mean=mean(log(amount))),
             aes(x = gender, y = Mean), color = "red", size = 3)

4. Feature Engineering

In this section, we create new attributes by transforming pre-processed data, that could potentially have useful information in differentiating fraud transactions or useful as control variables, and thus improving the performance of the model.

Based on the initial data exploration we performed so far, the following transformations have been identified and implemented below.

  1. Logarithmic transformation of amount.
  2. Modified category variable. Levels with low rates of fraud (<10%) have been clubbed together into one single variable called es_misc. This is expected to allow more risk differentiation.
  3. Modified age variable. levels of age {'1','2','3','4','5'} have similar fraud rates, hence they are grouped together into a single level for improving the variable performance.
transactions = transactions %>% 
  dplyr::mutate(
    log_amount = log(amount),
    mod_age = case_when(age %in% c('1','2','3','4','5','U') ~ '1-5',
                        TRUE ~ age
                        ),
    mod_category = case_when(category %in% c('es_leisure', 
                                             'es_hotelservices', 
                                             'es_home', 
                                             'es_sportsandtoys', 
                                             'es_travel', 
                                             'es_otherservices', 
                                             'es_health') ~ category,
                        TRUE ~ 'es_misc'
                        )
    
)

5. Fitting a classification model

sampling data

Create test and train data.

set.seed(123)
split_strat <- initial_split(transactions, prop = 0.8,
                             strata = 'fraud')
train <- training(split_strat)
test <- testing(split_strat)

Fit logistic regression model

Use train data to perform logistic regression.

model.log <- glm(fraud ~ log_amount + gender + mod_age + mod_category, data = train, family = binomial(link = 'logit'))
summary(model.log)
## 
## Call:
## glm(formula = fraud ~ log_amount + gender + mod_age + mod_category, 
##     family = binomial(link = "logit"), data = train)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -16.50971    0.61777 -26.725  < 2e-16 ***
## log_amount                     2.58600    0.03055  84.639  < 2e-16 ***
## genderF                        1.20044    0.52980   2.266  0.02346 *  
## genderM                        0.76095    0.53024   1.435  0.15126    
## genderU                      -10.08863   93.92589  -0.107  0.91446    
## mod_age1-5                     0.15720    0.26365   0.596  0.55102    
## mod_age6                      -0.02164    0.28137  -0.077  0.93869    
## mod_categoryes_home            0.20592    0.09852   2.090  0.03660 *  
## mod_categoryes_hotelservices   1.26722    0.08892  14.252  < 2e-16 ***
## mod_categoryes_leisure         4.55096    0.25623  17.761  < 2e-16 ***
## mod_categoryes_misc           -0.46759    0.05752  -8.129 4.31e-16 ***
## mod_categoryes_otherservices   1.73286    0.12311  14.076  < 2e-16 ***
## mod_categoryes_sportsandtoys   2.61282    0.06535  39.979  < 2e-16 ***
## mod_categoryes_travel         -0.65162    0.17027  -3.827  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61830  on 475671  degrees of freedom
## Residual deviance: 21351  on 475658  degrees of freedom
## AIC: 21379
## 
## Number of Fisher Scoring iterations: 15

Lets test the performance of the model on train and test data separately.

Predictions on train data.

prediction_train = predict(model.log, train, type = 'response')
prediction_train = factor((prediction_train>=0.50)*1)
confusionMatrix(prediction_train, factor(train$fraud), positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 469394   2345
##          1    571   3362
##                                           
##                Accuracy : 0.9939          
##                  95% CI : (0.9936, 0.9941)
##     No Information Rate : 0.988           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6945          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.589101        
##             Specificity : 0.998785        
##          Pos Pred Value : 0.854818        
##          Neg Pred Value : 0.995029        
##              Prevalence : 0.011998        
##          Detection Rate : 0.007068        
##    Detection Prevalence : 0.008268        
##       Balanced Accuracy : 0.793943        
##                                           
##        'Positive' Class : 1               
## 

Predictions on test data.

prediction_test = predict(model.log, test, type = 'response')
prediction_test = factor((prediction_test>=0.50)*1)
confusionMatrix(prediction_test, factor(test$fraud), positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 117294    634
##          1    132    859
##                                          
##                Accuracy : 0.9936         
##                  95% CI : (0.9931, 0.994)
##     No Information Rate : 0.9874         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6885         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.575352       
##             Specificity : 0.998876       
##          Pos Pred Value : 0.866801       
##          Neg Pred Value : 0.994624       
##              Prevalence : 0.012555       
##          Detection Rate : 0.007223       
##    Detection Prevalence : 0.008333       
##       Balanced Accuracy : 0.787114       
##                                          
##        'Positive' Class : 1              
## 

Using a 50% threshold (ie, fraud = 1 if predicted probability > 0.50), the Sensitivity of the model is only around 57%. The model is favoring low FPR (few false alarms) at the expense of missing frauds. At an even lower threshold like 20%, the model may start to show better sensitivity. This can also be visualized using ROC curve.

predicted_probs = predict(model.log, test, type = 'response')
roc_obj <- roc(test$fraud, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_obj)  # Output: AUC value
## Area under the curve: 0.959
plot(1-roc_obj$specificities, roc_obj$sensitivities, main = paste0("AUC = ", round(auc(roc_obj), 3)), xlab = 'false positive rate', ylab = 'True positive rate')
abline(a = 0, b = 1, lty = 2, col = "gray")  

Fit Naive bayes model

Use train data to perform Naive Bayes.

model.nb <- naive_bayes(factor(fraud) ~ log_amount + gender + mod_age + mod_category, data = train, usekernel = T, laplace = 10)
summary(model.nb)
## 
## ================================= Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = factor(fraud) ~ log_amount + gender +      mod_age + mod_category, data = train, laplace = 10, usekernel = T) 
## - Laplace: 10 
## - Classes: 2 
## - Samples: 475672 
## - Features: 4 
## - Conditional distributions: 
##     - Categorical: 3
##     - KDE: 1
## - Prior probabilities: 
##     - 0: 0.988
##     - 1: 0.012
## 
## --------------------------------------------------------------------------------

Predictions on train data.

prediction_train = predict(model.nb, train, type = 'class')
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
confusionMatrix(prediction_train, factor(train$fraud), positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 464864   1503
##          1   5101   4204
##                                           
##                Accuracy : 0.9861          
##                  95% CI : (0.9858, 0.9864)
##     No Information Rate : 0.988           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.5534          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.736639        
##             Specificity : 0.989146        
##          Pos Pred Value : 0.451800        
##          Neg Pred Value : 0.996777        
##              Prevalence : 0.011998        
##          Detection Rate : 0.008838        
##    Detection Prevalence : 0.019562        
##       Balanced Accuracy : 0.862893        
##                                           
##        'Positive' Class : 1               
## 

Predictions on test data.

prediction_test = predict(model.nb, test, type = 'class')
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
confusionMatrix(prediction_test, factor(test$fraud), positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 116137    415
##          1   1289   1078
##                                          
##                Accuracy : 0.9857         
##                  95% CI : (0.985, 0.9863)
##     No Information Rate : 0.9874         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.5516         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.722036       
##             Specificity : 0.989023       
##          Pos Pred Value : 0.455429       
##          Neg Pred Value : 0.996439       
##              Prevalence : 0.012555       
##          Detection Rate : 0.009065       
##    Detection Prevalence : 0.019904       
##       Balanced Accuracy : 0.855530       
##                                          
##        'Positive' Class : 1              
## 

Model has a sensitivity of almost 72% which means 72% of the fraud cases can be detected using the naive bayes algorithm. It also has a specificity of 99%, which is also good. Specificity indicates how well the model is able to distinguish a non fraud case.

Now lets take a loot at the AUC value and ROC curve.

predicted_probs = predict(model.nb, test, type = 'prob')[,"1"]
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
roc_obj <- roc(test$fraud, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_obj)  # Output: AUC value
## Area under the curve: 0.9771
plot(1-roc_obj$specificities, roc_obj$sensitivities, main = paste0("AUC = ", round(auc(roc_obj), 3)), xlab = 'false positive rate', ylab = 'True positive rate')
abline(a = 0, b = 1, lty = 2, col = "gray")  

6. Conclusion

Out of the models, Naive Bayes is observed to perform fairly well compared to other modelling techniques, displaying a sensitivity of 72%. This is decent considering the imbalance class problem we observed.