This synthetically generated dataset consists of payments from various customers made in different time periods and with different amounts. For more information on the dataset you can check the Kaggle page for this dataset:
[https://www.kaggle.com/code/turkayavci/fraud-detection-on-bank-payments/notebook]
Data As we can see in the first rows below the dataset has 9 feature columns and a target column. The feature columms are :
Step: This feature represents the day from the start of
simulation. It has 180 steps so simulation ran for virtually 6
months.
Customer: This feature represents the customer
id.
zipCodeOrigin: The zip code of origin/source.
Merchant: The merchant’s id.
zipMerchant: The
merchant’s zip code.
Age: Categorized age
0: <= 18,
1: 19-25,
2:
26-35,
3: 36-45,
4: 46:55,
5: 56:65,
6: > 65
U:
Unknown
Gender: Gender for customer
E: Enterprise,
F:
Female,
M: Male,
U: Unknown
Category: Category of
the purchase. I won’t write all categories here, we’ll see them later in
the analysis.
Amount: Amount of the purchase.
Fraud: Target variable which shows if the transaction
fraudulent(1) or benign(0)
library(dplyr)
library(tidyr)
library(corrplot)
library(caret)
library(rsample)
#library(rpart) # for decision tree
#library(rpart.plot)
#library(rattle) # for plotting decision tree
#library(e1071) # For Naive Bayes classifier
library(naivebayes)
#library(randomForest)
library(pROC)
transactions = read.csv(paste0(getwd(),"/bs140513_032310.csv"))
head(transactions)
## step customer age gender zipcodeOri merchant zipMerchant
## 1 0 'C1093826151' '4' 'M' '28007' 'M348934600' '28007'
## 2 0 'C352968107' '2' 'M' '28007' 'M348934600' '28007'
## 3 0 'C2054744914' '4' 'F' '28007' 'M1823072687' '28007'
## 4 0 'C1760612790' '3' 'M' '28007' 'M348934600' '28007'
## 5 0 'C757503768' '5' 'M' '28007' 'M348934600' '28007'
## 6 0 'C1315400589' '3' 'F' '28007' 'M348934600' '28007'
## category amount fraud
## 1 'es_transportation' 4.55 0
## 2 'es_transportation' 39.68 0
## 3 'es_transportation' 26.89 0
## 4 'es_transportation' 17.25 0
## 5 'es_transportation' 35.72 0
## 6 'es_transportation' 25.81 0
Now lets look at the structure of th imported data.
str(transactions)
## 'data.frame': 594643 obs. of 10 variables:
## $ step : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer : chr "'C1093826151'" "'C352968107'" "'C2054744914'" "'C1760612790'" ...
## $ age : chr "'4'" "'2'" "'4'" "'3'" ...
## $ gender : chr "'M'" "'M'" "'F'" "'M'" ...
## $ zipcodeOri : chr "'28007'" "'28007'" "'28007'" "'28007'" ...
## $ merchant : chr "'M348934600'" "'M348934600'" "'M1823072687'" "'M348934600'" ...
## $ zipMerchant: chr "'28007'" "'28007'" "'28007'" "'28007'" ...
## $ category : chr "'es_transportation'" "'es_transportation'" "'es_transportation'" "'es_transportation'" ...
## $ amount : num 4.55 39.68 26.89 17.25 35.72 ...
## $ fraud : int 0 0 0 0 0 0 0 0 0 0 ...
We notice that the character fields are within single quotes
''. Remove the ' character around each entry
in the data.
charcols = sapply(transactions,class) == "character"
print(colnames(transactions[charcols]))
## [1] "customer" "age" "gender" "zipcodeOri" "merchant"
## [6] "zipMerchant" "category"
transactions[charcols] <- lapply(transactions[charcols], gsub, pattern="'", replacement='')
head(transactions)
## step customer age gender zipcodeOri merchant zipMerchant
## 1 0 C1093826151 4 M 28007 M348934600 28007
## 2 0 C352968107 2 M 28007 M348934600 28007
## 3 0 C2054744914 4 F 28007 M1823072687 28007
## 4 0 C1760612790 3 M 28007 M348934600 28007
## 5 0 C757503768 5 M 28007 M348934600 28007
## 6 0 C1315400589 3 F 28007 M348934600 28007
## category amount fraud
## 1 es_transportation 4.55 0
## 2 es_transportation 39.68 0
## 3 es_transportation 26.89 0
## 4 es_transportation 17.25 0
## 5 es_transportation 35.72 0
## 6 es_transportation 25.81 0
Verify the structure of data…
str(transactions)
## 'data.frame': 594643 obs. of 10 variables:
## $ step : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer : chr "C1093826151" "C352968107" "C2054744914" "C1760612790" ...
## $ age : chr "4" "2" "4" "3" ...
## $ gender : chr "M" "M" "F" "M" ...
## $ zipcodeOri : chr "28007" "28007" "28007" "28007" ...
## $ merchant : chr "M348934600" "M348934600" "M1823072687" "M348934600" ...
## $ zipMerchant: chr "28007" "28007" "28007" "28007" ...
## $ category : chr "es_transportation" "es_transportation" "es_transportation" "es_transportation" ...
## $ amount : num 4.55 39.68 26.89 17.25 35.72 ...
## $ fraud : int 0 0 0 0 0 0 0 0 0 0 ...
Values of zipcodeOri and zipMerchant are same throughout the dataa so might as well drop them.
Unique values of zipcodeOri
unique(transactions$zipcodeOri)
## [1] "28007"
Unique values of zipMerchant
unique(transactions$zipMerchant)
## [1] "28007"
Since there is one value for zipMerchant and zipcodeOri, we can remove these fields.
transactions = transactions %>% dplyr::select(-c(zipMerchant, zipcodeOri))
Now lets look at the data for NULL values in any attributes.
colSums(is.na(transactions))
## step customer age gender merchant category amount fraud
## 0 0 0 0 0 0 0 0
No missing observations are present in the data. Before moving to the next step, lets look at the size of the data
dim(transactions)
## [1] 594643 8
In this section, we analyse each variable individually to gain insight from it. We explore the distribution properties of the variables and also their relationship with Fraud Rate.
‘fraud’ is a binary indicator denoting whether a transaction is fraudulent or not. We expect an overwhelming majority of the transactions to be not Fraud.
print("summary statistics for variable Fraud:")
## [1] "summary statistics for variable Fraud:"
cat("\n")
summary(as.factor(transactions$fraud))
## 0 1
## 587443 7200
transtype = table(transactions$fraud) %>% as.data.frame() %>% rename("fraud"="Var1")
ggplot(transtype, aes(x = fraud, y = Freq, fill = fraud)) +
geom_bar(stat = "identity", color = "black") +
labs(x = "fraud", y = "Count", title = "Bar Chart of Distribution by fraud") +
theme_minimal()
Fraud data will be imbalanced like you see in the plot above and from
the count of instances. To balance the dataset one could perform
oversample or undersample technique, although no such method has been
adopted for the purpose of this model exercise..
Unique values of gender are
unique(transactions$gender)
## [1] "M" "F" "E" "U"
Lets look at the distribution by each level of gender, including the fraud rate. It helps us understand if Fraudsters choose specific gender for fraud operations.
gender = group_by(transactions, gender) %>% summarise(Freq=n(), fraudcount = sum(fraud), fraudrate=sum(fraud)/n()) %>% as.data.frame() %>% mutate(Perc = Freq/sum(Freq))
print("Distribution for variable gender:")
## [1] "Distribution for variable gender:"
cat("\n")
gender
## gender Freq fraudcount fraudrate Perc
## 1 E 1178 7 0.005942275 0.0019810205
## 2 F 324565 4758 0.014659621 0.5458148839
## 3 M 268385 2435 0.009072787 0.4513380297
## 4 U 515 0 0.000000000 0.0008660659
From the data it looks like the number of fraud cases is twice for Females as compared to males. But overall, females also performed more number of banking transactions. However, if we look at the fraudrate (count of frauds to total count of transactions), we see that likelihood of a female customer being a victim of fraud transaction is still higher compared to males.
Lets visualize this using a histogram and line graph.
ggplot(gender, aes(x = gender, y = Perc)) +
geom_bar(stat = "identity", col = "black", aes(fill=gender)) +
geom_line(aes(x = gender, y = 40*fraudrate, group = 1), color="red", stat = "identity")+
labs(x = "gender", y = "Count", title = "Distribution/fraud by gender") +
scale_y_continuous(sec.axis=sec_axis(~./40*100,name="%Frauds"))
So in conclusion, we observe that majority of transactions are made by males and females, with females showing more likelihood of fraud than males. Enterprise transactions have 3rd highest prevalence of fraud transaction among the genders.
First lets look at the unique values of age.
unique(transactions$age)
## [1] "4" "2" "3" "5" "1" "6" "U" "0"
Next we look at the distribution of variable age. Age is a categorical variable, so the univariate analysis involves looking at the distribution of the distinct values and the trend of % fraud among them.
age = group_by(transactions, age) %>% summarise(Freq=n(), fraudcount=sum(fraud), fraudrate=sum(fraud)/n()) %>% as.data.frame() %>% mutate(Perc = Freq/sum(Freq))
print("summary statistics for variable age:")
## [1] "summary statistics for variable age:"
cat("\n")
age
## age Freq fraudcount fraudrate Perc
## 1 0 2452 48 0.019575856 0.004123482
## 2 1 58131 689 0.011852540 0.097757814
## 3 2 187310 2344 0.012514014 0.314995720
## 4 3 147131 1755 0.011928146 0.247427448
## 5 4 109025 1410 0.012932814 0.183345301
## 6 5 62642 686 0.010951119 0.105343879
## 7 6 26774 261 0.009748263 0.045025335
## 8 U 1178 7 0.005942275 0.001981021
Lets visualise to understand this table further.
ggplot(age) +
geom_bar(aes(x = age, y = Perc), stat = "identity", col = "black", fill='skyblue') +
geom_line(aes(x = age, y = 10*fraudrate, group = 1), color="red", stat = "identity")+
labs(x = "age", y = "Count", title = "Distribution/Fraudrate by age") +
scale_y_continuous(sec.axis=sec_axis(~./10,name="%Frauds"))
Majority of transactions are made by people in the age groups 2,3,4, which represents the age 26-55. The age group 0 has the highest risk of fraud at around 2%. The percentage of frauds are almost constant across groups 1 to 4, and then its falls slightly in groups 5 and 6. Hence it could mean that older people are less likely to be a victim of fraud whereas very young people (<18) are highly likely.
Before performing univariate analysis, we find the cases where transaction amount <= 0. This could be potential data issues.
sum(transactions$amount==0)
## [1] 52
There are 52 records where this happens, so we need to remove these observations also.
transactions = transactions %>% dplyr::filter(amount>0)
Now lets look at the summary statistics of amount
variable.
print("summary statistics for variable amount:")
## [1] "summary statistics for variable amount:"
cat("\n")
summary(transactions$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 13.74 26.90 37.89 42.54 8329.96
The summary statistics output indicates that the distribution of
amount could be right skewed as Mean>Median. Now lets
visualize the distribution of the variable using a histogram.
From the distribution of the variable, it can be observed that the variable amount is heavily right skewed.Note that the plot is right censored to account for extremely large valus.
Log transformation of amount To get a better view at
the data, we can study the distribution of log(amount), as
the logarithmic transformation will make the right skewed data more
symmetric.
Clearly, log(amount) offers a better view of the
distribution of the variable. This could indicate a potential new
feature that could be derived to be used in the model.
Now lets look at the distribution of log(amount) by
fraud to see if there is difference in distribution between
fraud transactions and normal transactions.
ggplot(data = transactions,aes(x = log(amount), y=after_stat(density), fill = factor(fraud))) +
# geom_density(alpha=0.6, position = 'identity') +
geom_histogram(alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
labs(fill="", title=paste0("Histogram of log(amount) by fraud"), x='log(amount)', y = "density")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can observe that the distribution of amount in fraud transactions is significantly towards the right when compared to normal transactions, indicating a differentiation between transaction amounts of fraud and non fraud cases. Larger transaction amounts are possible warning signals for possible fraud.
Next we look at the distribution of variable transaction type. It is a categorical variable, so the univariate analysis involves looking at the distribution of the distinct values of the attribute.
transtype = group_by(transactions, category) %>%
summarise(Freq=n(), fraudcount=sum(fraud), fraudrate=sum(fraud)/n()) %>% as.data.frame() %>% mutate(Perc = Freq/sum(Freq))
print("summary statistics for variable category:")
## [1] "summary statistics for variable category:"
cat("\n")
transtype
## category Freq fraudcount fraudrate Perc
## 1 es_barsandrestaurants 6373 120 0.01882944 0.0107182921
## 2 es_contents 885 0 0.00000000 0.0014884181
## 3 es_fashion 6454 116 0.01797335 0.0108545202
## 4 es_food 26250 0 0.00000000 0.0441479942
## 5 es_health 16133 1696 0.10512614 0.0271329368
## 6 es_home 1986 302 0.15206445 0.0033401111
## 7 es_hotelservices 1744 548 0.31422018 0.0029331086
## 8 es_hyper 6098 280 0.04591669 0.0102557893
## 9 es_leisure 499 474 0.94989980 0.0008392323
## 10 es_otherservices 912 228 0.25000000 0.0015338275
## 11 es_sportsandtoys 4002 1982 0.49525237 0.0067306771
## 12 es_tech 2370 158 0.06666667 0.0039859332
## 13 es_transportation 505071 0 0.00000000 0.8494427262
## 14 es_travel 728 578 0.79395604 0.0012243710
## 15 es_wellnessandbeauty 15086 718 0.04759380 0.0253720625
The count of observations in each value for this attribute can be visualised through a barchart. The barchart can be generated using ‘geom_bar( )’ command in ggplot2 package.
The barchart is supplemented by a line graph of the % of fraud cases in each category.
ggplot(transtype) +
geom_bar(aes(x = category, y = Perc), col = "black", stat = "identity") +
geom_line(aes(x = category, y = fraudrate, group = 1), color="red", stat = "identity")+
labs(x = "Transaction category", y = "Percentage", title = "Distribution/Fraudrate by category") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(sec.axis=sec_axis(~.,name="%Frauds"))
The graph shows that es_leisure, es_sportsandtoys, and es_travel have high prevalence of fraud cases.
We observe that ‘es_transportation’ has majority of the observations yet low fraud cases.To get a view of the distribution of the remaining categories, we may observe the barplot after removing this category from the data.
ggplot(subset(transtype, category!="es_transportation")) +
geom_bar(aes(x = category, y = Perc), col = "black", stat = "identity") +
geom_line(aes(x = category, y = 0.04*fraudrate, group = 1), color="red", stat = "identity")+
labs(x = "Transaction category", y = "Percentage", title = "Distribution/Fraudrate by category (excl es_transportation)") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(sec.axis=sec_axis(~./0.04,name="%Frauds"))
There are more credit card transactions in the categories es_food, es_health and es_wellnessandbeauty . But the same categories also have lower prevalence of fraud transactions. There are very few credit card transactions in categories with high prevalence of fraud - es_leisure, es_sportsandtoys, and es_travel
In Bivariate analysis we look at the strength of relationship of variables with each other. We look at the relationship between the numeric variables using correlation matrix.
The correlation between variables amount and
step is negligible.
## step amount
## step 1.000000000 -0.007959275
## amount -0.007959275 1.000000000
The two variables have very low correlation with each other. Therefore the two variables are not expected to interfere with each other during the modelling stage. Similarly, we can design a Bivariate analysis for the categorical variables.
Since correlation is only meaningful for continuous variables, we can define the distribution of Transaction category by age.
cat_by_age = table(transactions$category, transactions$age) %>%
as.data.frame() %>%
rename("category"="Var1", "age"="Var2") %>%
# group_by(IsFraud) %>%
mutate(Perc = Freq/sum(Freq) * 100)
cat_by_age[c("category", "age", "Perc")] %>%
pivot_wider(names_from = age, values_from = Perc, names_prefix = "age =")
## # A tibble: 15 × 9
## category `age =0` `age =1` `age =2` `age =3` `age =4` `age =5` `age =6`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 es_barsandres… 0.00538 0.102 0.352 0.250 0.194 0.114 0.0516
## 2 es_contents 0.000673 0.0167 0.0474 0.0395 0.0244 0.0141 0.00538
## 3 es_fashion 0.00639 0.103 0.332 0.281 0.204 0.113 0.0432
## 4 es_food 0.0207 0.435 1.38 1.13 0.800 0.457 0.175
## 5 es_health 0.0190 0.282 0.854 0.681 0.497 0.257 0.116
## 6 es_home 0.00202 0.0330 0.103 0.0817 0.0659 0.0336 0.0148
## 7 es_hotelservi… 0.000841 0.0299 0.0928 0.0720 0.0537 0.0308 0.0119
## 8 es_hyper 0.00420 0.0910 0.326 0.244 0.193 0.111 0.0542
## 9 es_leisure 0.000841 0.00975 0.0227 0.0232 0.0148 0.00908 0.00336
## 10 es_otherservi… 0.00135 0.0150 0.0447 0.0436 0.0298 0.0119 0.00673
## 11 es_sportsandt… 0.00269 0.0634 0.213 0.161 0.136 0.0658 0.0309
## 12 es_tech 0.00269 0.0404 0.127 0.0994 0.0727 0.0385 0.0167
## 13 es_transporta… 0.330 8.32 26.7 21.0 15.6 9.01 3.86
## 14 es_travel 0.00101 0.0109 0.0400 0.0321 0.0230 0.00975 0.00555
## 15 es_wellnessan… 0.0150 0.227 0.819 0.630 0.476 0.255 0.111
## # ℹ 1 more variable: `age =U` <dbl>
Distribution of Transaction Category vs age
ggplot(data = cat_by_age, aes(x = age, y = category, fill = Perc)) +
geom_tile()+
scale_fill_gradient(low = "white", high = "darkred")+
labs(x = "age", y = "category", title = "Heatmap")
Majority of spending is in transportation in all age groups and especially in groups 2,3 and 4. lets remove the records on transportation and see if the heat maps reveal any relationships.
ggplot(data = subset(cat_by_age, category!="es_transportation"), aes(x = age, y = category, fill = Perc)) +
geom_tile()+
scale_fill_gradient(low = "white", high = "darkred")+
labs(x = "category", y = "age", title = "Heatmap (excl es_transportation)")
From this graph and table, we infer that for almost all age groups, people spend majority of transactions for health, food and wellness&beauty. Categories with High frequency of transactions could be potential targets for fraudsters.
Distribution of Transaction Category vs amount
We can see the mean amount and fraud percent by category below. Fraudsters may chose the categories which people spend more on average. Let’s confirm this hypothesis by checking the average amount transacted in each category.
transactions %>% group_by(category) %>% summarise(MeanAmount=mean(amount), fraudrate=sum(fraud)/n())
## # A tibble: 15 × 3
## category MeanAmount fraudrate
## <chr> <dbl> <dbl>
## 1 es_barsandrestaurants 43.5 0.0188
## 2 es_contents 44.5 0
## 3 es_fashion 65.7 0.0180
## 4 es_food 37.1 0
## 5 es_health 136. 0.105
## 6 es_home 166. 0.152
## 7 es_hotelservices 206. 0.314
## 8 es_hyper 46.0 0.0459
## 9 es_leisure 289. 0.950
## 10 es_otherservices 136. 0.25
## 11 es_sportsandtoys 216. 0.495
## 12 es_tech 121. 0.0667
## 13 es_transportation 27.0 0
## 14 es_travel 2250. 0.794
## 15 es_wellnessandbeauty 65.5 0.0476
In categories like leisure and travel which are the most selected categories for fraudsters, we can see that transaction amount is also significantly high. What this means is that fraudsters are likely to choose travel related transactions(example flight/train bookings or tours) for fraud. Travel related transaction in particular marks a significatly high average transaction value.
ggplot(data = transactions) +
geom_boxplot(aes(x = category, y = amount))+
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Boxplot category vs amount") +
ylim(0, 3000)
## Warning: Removed 222 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
We can also look at distribution of amount by category after
excluding es_travelfor better analysis of remaining
categories.
ggplot(data = subset(transactions, category!="es_travel")) +
geom_boxplot(aes(x = category, y = amount))+
theme(axis.text.x = element_text(angle = 90))+
ylim(0, 1000) +
labs(title = "Boxplot category vs amount (excl es_travel)")
## Warning: Removed 146 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
After removing the category es_travel, we observe that
majority of the transactions are within transaction amount of 500. We
observe few outliers in each category but even the max value of outliers
is less than 2000.
Other than travel, next highest median spending is for
leisure, sports and toys and
hotel service, all of which have significantly high cases
of fraud. Hence, our hypothesis that fraudsters choosing the categories
which people spend more is only partly correct.
Distribution of age vs amount
The red dot is the mean of each age group.
ggplot(data = transactions, aes(x = age, y = amount)) +
geom_boxplot()+
labs(title = "Boxplot age vs amount") +
geom_point(data = transactions %>% group_by(age) %>% summarise(Mean=mean(amount)),
aes(x = age, y = Mean), color = "red", size = 3) +
ylim(0, 100)
## Warning: Removed 20555 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot has been trimmed to accommodate only the range bars and exclude some of the outliers because to enhance visibility. The boxplot shows that median transaction amount is similar across groups. Also, the initerquartlie ranges are similar because the boxes are comparable in size.
The mean transaction amount is highest for group 0 (<18). This age group is also more likely to be a victim of fraud as compared to others. This could indicate that fraudsters target certain age group more than others.
Distribution of gender vs amount
The red dot is the mean of each gender.
transactions %>% group_by(gender) %>% summarise(Mean=mean(amount), Median = median(amount))
## # A tibble: 4 × 3
## gender Mean Median
## <chr> <dbl> <dbl>
## 1 E 36.7 27.2
## 2 F 39.2 27
## 3 M 36.3 26.8
## 4 U 31.5 25.2
ggplot(data = transactions, aes(x = gender, y = amount)) +
geom_boxplot()+
labs(title = "Boxplot gender vs amount") +
geom_point(data = transactions %>% group_by(gender) %>% summarise(Mean=mean(amount)),
aes(x = gender, y = Mean), color = "red", size = 3) +
ylim(0,100)
## Warning: Removed 20555 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot has been trimmed to accommodate only the range bars and exclude some of the outliers because to enhance visibility. The boxplot shows that median transaction amount is similar across groups. Also, the inter quartile ranges are similar because the boxes are comparable in size. The mean and median transaction amount is highest for females but not significantly different from males.
The same can be understood with the help of boxplots after logarithmic transformation - log(amount)
ggplot(data = transactions, aes(x = gender, y = log(amount))) +
geom_boxplot()+
labs(title = "Boxplot gender vs log(amount)") +
geom_point(data = transactions %>% group_by(gender) %>% summarise(Mean=mean(log(amount))),
aes(x = gender, y = Mean), color = "red", size = 3)
In this section, we create new attributes by transforming pre-processed data, that could potentially have useful information in differentiating fraud transactions or useful as control variables, and thus improving the performance of the model.
Based on the initial data exploration we performed so far, the following transformations have been identified and implemented below.
amount.es_misc. This is expected to allow more risk
differentiation.{'1','2','3','4','5'} have similar fraud rates, hence they
are grouped together into a single level for improving the variable
performance. transactions = transactions %>%
dplyr::mutate(
log_amount = log(amount),
mod_age = case_when(age %in% c('1','2','3','4','5','U') ~ '1-5',
TRUE ~ age
),
mod_category = case_when(category %in% c('es_leisure',
'es_hotelservices',
'es_home',
'es_sportsandtoys',
'es_travel',
'es_otherservices',
'es_health') ~ category,
TRUE ~ 'es_misc'
)
)
Create test and train data.
set.seed(123)
split_strat <- initial_split(transactions, prop = 0.8,
strata = 'fraud')
train <- training(split_strat)
test <- testing(split_strat)
Use train data to perform logistic regression.
model.log <- glm(fraud ~ log_amount + gender + mod_age + mod_category, data = train, family = binomial(link = 'logit'))
summary(model.log)
##
## Call:
## glm(formula = fraud ~ log_amount + gender + mod_age + mod_category,
## family = binomial(link = "logit"), data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.50971 0.61777 -26.725 < 2e-16 ***
## log_amount 2.58600 0.03055 84.639 < 2e-16 ***
## genderF 1.20044 0.52980 2.266 0.02346 *
## genderM 0.76095 0.53024 1.435 0.15126
## genderU -10.08863 93.92589 -0.107 0.91446
## mod_age1-5 0.15720 0.26365 0.596 0.55102
## mod_age6 -0.02164 0.28137 -0.077 0.93869
## mod_categoryes_home 0.20592 0.09852 2.090 0.03660 *
## mod_categoryes_hotelservices 1.26722 0.08892 14.252 < 2e-16 ***
## mod_categoryes_leisure 4.55096 0.25623 17.761 < 2e-16 ***
## mod_categoryes_misc -0.46759 0.05752 -8.129 4.31e-16 ***
## mod_categoryes_otherservices 1.73286 0.12311 14.076 < 2e-16 ***
## mod_categoryes_sportsandtoys 2.61282 0.06535 39.979 < 2e-16 ***
## mod_categoryes_travel -0.65162 0.17027 -3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61830 on 475671 degrees of freedom
## Residual deviance: 21351 on 475658 degrees of freedom
## AIC: 21379
##
## Number of Fisher Scoring iterations: 15
Lets test the performance of the model on train and test data separately.
Predictions on train data.
prediction_train = predict(model.log, train, type = 'response')
prediction_train = factor((prediction_train>=0.50)*1)
confusionMatrix(prediction_train, factor(train$fraud), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 469394 2345
## 1 571 3362
##
## Accuracy : 0.9939
## 95% CI : (0.9936, 0.9941)
## No Information Rate : 0.988
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6945
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.589101
## Specificity : 0.998785
## Pos Pred Value : 0.854818
## Neg Pred Value : 0.995029
## Prevalence : 0.011998
## Detection Rate : 0.007068
## Detection Prevalence : 0.008268
## Balanced Accuracy : 0.793943
##
## 'Positive' Class : 1
##
Predictions on test data.
prediction_test = predict(model.log, test, type = 'response')
prediction_test = factor((prediction_test>=0.50)*1)
confusionMatrix(prediction_test, factor(test$fraud), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117294 634
## 1 132 859
##
## Accuracy : 0.9936
## 95% CI : (0.9931, 0.994)
## No Information Rate : 0.9874
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6885
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.575352
## Specificity : 0.998876
## Pos Pred Value : 0.866801
## Neg Pred Value : 0.994624
## Prevalence : 0.012555
## Detection Rate : 0.007223
## Detection Prevalence : 0.008333
## Balanced Accuracy : 0.787114
##
## 'Positive' Class : 1
##
Using a 50% threshold (ie, fraud = 1 if predicted probability > 0.50), the Sensitivity of the model is only around 57%. The model is favoring low FPR (few false alarms) at the expense of missing frauds. At an even lower threshold like 20%, the model may start to show better sensitivity. This can also be visualized using ROC curve.
predicted_probs = predict(model.log, test, type = 'response')
roc_obj <- roc(test$fraud, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_obj) # Output: AUC value
## Area under the curve: 0.959
plot(1-roc_obj$specificities, roc_obj$sensitivities, main = paste0("AUC = ", round(auc(roc_obj), 3)), xlab = 'false positive rate', ylab = 'True positive rate')
abline(a = 0, b = 1, lty = 2, col = "gray")
Use train data to perform Naive Bayes.
model.nb <- naive_bayes(factor(fraud) ~ log_amount + gender + mod_age + mod_category, data = train, usekernel = T, laplace = 10)
summary(model.nb)
##
## ================================= Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = factor(fraud) ~ log_amount + gender + mod_age + mod_category, data = train, laplace = 10, usekernel = T)
## - Laplace: 10
## - Classes: 2
## - Samples: 475672
## - Features: 4
## - Conditional distributions:
## - Categorical: 3
## - KDE: 1
## - Prior probabilities:
## - 0: 0.988
## - 1: 0.012
##
## --------------------------------------------------------------------------------
Predictions on train data.
prediction_train = predict(model.nb, train, type = 'class')
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
confusionMatrix(prediction_train, factor(train$fraud), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 464864 1503
## 1 5101 4204
##
## Accuracy : 0.9861
## 95% CI : (0.9858, 0.9864)
## No Information Rate : 0.988
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5534
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.736639
## Specificity : 0.989146
## Pos Pred Value : 0.451800
## Neg Pred Value : 0.996777
## Prevalence : 0.011998
## Detection Rate : 0.008838
## Detection Prevalence : 0.019562
## Balanced Accuracy : 0.862893
##
## 'Positive' Class : 1
##
Predictions on test data.
prediction_test = predict(model.nb, test, type = 'class')
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
confusionMatrix(prediction_test, factor(test$fraud), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 116137 415
## 1 1289 1078
##
## Accuracy : 0.9857
## 95% CI : (0.985, 0.9863)
## No Information Rate : 0.9874
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5516
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.722036
## Specificity : 0.989023
## Pos Pred Value : 0.455429
## Neg Pred Value : 0.996439
## Prevalence : 0.012555
## Detection Rate : 0.009065
## Detection Prevalence : 0.019904
## Balanced Accuracy : 0.855530
##
## 'Positive' Class : 1
##
Model has a sensitivity of almost 72% which means 72% of the fraud cases can be detected using the naive bayes algorithm. It also has a specificity of 99%, which is also good. Specificity indicates how well the model is able to distinguish a non fraud case.
Now lets take a loot at the AUC value and ROC curve.
predicted_probs = predict(model.nb, test, type = 'prob')[,"1"]
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
roc_obj <- roc(test$fraud, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc_obj) # Output: AUC value
## Area under the curve: 0.9771
plot(1-roc_obj$specificities, roc_obj$sensitivities, main = paste0("AUC = ", round(auc(roc_obj), 3)), xlab = 'false positive rate', ylab = 'True positive rate')
abline(a = 0, b = 1, lty = 2, col = "gray")
Out of the models, Naive Bayes is observed to perform fairly well compared to other modelling techniques, displaying a sensitivity of 72%. This is decent considering the imbalance class problem we observed.