The dataset used is from https:/paysim1/data Paysim supplied artificial data for mobile money transactions spanning one month. This dataset comprises 11 attributes, listed below, and approximately 6.3 million entries. An overview of the variable properties is presented here:
Libraries and import data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
transactions<-read.csv("fraud.csv", header=TRUE, sep=",")
View the data
head(transactions)
## step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860 46042.29 M573487274
## oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 1 0
## 4 21182 0 1 0
## 5 0 0 0 0
## 6 0 0 0 0
dim(transactions)
## [1] 6362620 11
The datset contains 11 attributes and 6,362,620 transaction observations
Let’s view a summary of the data.
summary(transactions)
## step type amount nameOrig
## Min. : 1.0 Length:6362620 Min. : 0 Length:6362620
## 1st Qu.:156.0 Class :character 1st Qu.: 13390 Class :character
## Median :239.0 Mode :character Median : 74872 Mode :character
## Mean :243.4 Mean : 179862
## 3rd Qu.:335.0 3rd Qu.: 208721
## Max. :743.0 Max. :92445517
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Min. : 0 Min. : 0 Length:6362620 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 Class :character 1st Qu.: 0
## Median : 14208 Median : 0 Mode :character Median : 132706
## Mean : 833883 Mean : 855114 Mean : 1100702
## 3rd Qu.: 107315 3rd Qu.: 144258 3rd Qu.: 943037
## Max. :59585040 Max. :49585040 Max. :356015889
## newbalanceDest isFraud isFlaggedFraud
## Min. : 0 Min. :0.000000 Min. :0.0e+00
## 1st Qu.: 0 1st Qu.:0.000000 1st Qu.:0.0e+00
## Median : 214661 Median :0.000000 Median :0.0e+00
## Mean : 1224996 Mean :0.001291 Mean :2.5e-06
## 3rd Qu.: 1111909 3rd Qu.:0.000000 3rd Qu.:0.0e+00
## Max. :356179279 Max. :1.000000 Max. :1.0e+00
These summary statistics provide a quick overview of the data in each column, including measures of central tendency (like mean and median), measures of spread (like range and quartiles), and other relevant information. type” is a character variable, so it shows the length, class, and mode of the values in this column. “amount” is another numerical variable with similar summary statistics. “nameOrig” is a character variable, so it also displays length, class, and mode. “oldbalanceOrg” and “newbalanceOrig” are numerical variables with their summary statistics. “nameDest” is a character variable, showing length, class, and mode. “oldbalanceDest” is a numerical variable with its summary statistics. “newbalanceDest” is another numerical variable with summary statistics. “isFraud” is a binary variable (0 or 1) with its minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. “isFlaggedFraud” is a binary variable (0 or 1) with its minimum, 1st quartile, median, mean, 3rd quartile, and maximum values.
Using the isFraud variable, number of fraud vs not fraud transactions
fraud_count<- transactions %>% count(isFraud)
print(fraud_count)
## isFraud n
## 1 0 6354407
## 2 1 8213
Lets visualize this through a frequency plot
g <- ggplot(transactions, aes(x = factor(isFraud)))
g + geom_bar() +
geom_text(stat = 'count', aes(label = scales::percent((..count..) / sum(..count..), accuracy = 0.1)))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
The dataset exhibits a significant imbalance, with fraudulent transactions accounting for a mere 0.1% of the total data. To mitigate potential model bias, it’s advisable to explore sampling techniques, such as undersampling, as a preliminary step before moving on to the modeling phase
ggplot(data = transactions, aes(x = type , fill = as.factor(isFraud))) + geom_bar() + labs(title = “Frequency of Transaction Type”, subtitle = “Fraud vs Not Fraud”, x = ‘Transaction Type’ , y = ‘No of transactions’ ) +theme_classic() The plot displayed above may not offer much insight into fraud transactions due to their infrequency. Let’s focus on creating a plot specifically for transaction types in cases of fraud.
Ggplot showing frequency of Fraud Transactions for each Transaction type
Fraud_transaction_type <- transactions %>% group_by(type) %>% summarise(fraud_transactions = sum(isFraud))
ggplot(data = Fraud_transaction_type, aes(x = type, y = fraud_transactions)) + geom_col(aes(fill = 'type'), show.legend = FALSE) + labs(title = 'Fraud transactions as Per type', x = 'Transcation type', y = 'No of Fraud Transactions') + geom_label(aes(label = fraud_transactions)) + theme_classic()
As observed in the plot above, fraud transactions are exclusively associated with the “CASH_OUT” and “TRANSFER” transaction types. This observation will be significant in our subsequent analysis, as we can streamline our assessment by concentrating solely on these two transaction types.
ggplot(data = transactions[transactions$isFraud==1,], aes(x = amount , fill =amount)) + geom_histogram(bins = 30, aes(fill = 'amount')) + labs(title = 'Fraud transaction Amount distribution', y = 'No. of Fraud transacts', x = 'Amount in Dollars')
The distribution of transaction amounts for fraudulent cases exhibits a pronounced right skew. This indicates that the majority of fraudulent transactions involve smaller amounts.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(data = transactions, aes(x = factor(isFraud), y = log1p(oldbalanceOrg), fill = factor(isFraud))) +
geom_boxplot(show.legend = FALSE) +
labs(title = 'Old Balance in Sender Accounts', x = 'isFraud', y = 'Balance Amount') +
theme_classic()
p2 <- ggplot(data = transactions, aes(x = factor(isFraud), y = log1p(oldbalanceDest), fill = factor(isFraud))) +
geom_boxplot(show.legend = FALSE) +
labs(title = 'Old Balance in Receiver Accounts', x = 'isFraud', y = 'Balance Amount') +
theme_classic()
grid.arrange(p1, p2, nrow = 1)
For most fraudulent transactions, the initial account’s old balance
(where payments originate) tends to be higher than the old balances of
other origin accounts, while the old balance in destination accounts is
lower than the rest.
# Convert Step to Hours in 24 hours format
transactions$hour <- transactions$step %% 24
#Plot newly formatted data
p5<- ggplot(data = transactions, aes(x = hour)) + geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +labs(title= 'Total transactions at different Hours', y = 'No. of transactions') + theme_classic()
p6<-ggplot(data = transactions[transactions$isFraud==1,], aes(x = hour)) + geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +labs(title= 'Fraud transactions at different Hours', y = 'No. of fraud transactions') + theme_classic()
grid.arrange(p5, p6, ncol = 1, nrow = 2)
The overall transaction count during the 0 to 9-hour period is notably
low. However, this pattern doesn’t hold true for fraudulent
transactions. It can be inferred that fraudulent activities are more
frequent from midnight to 9 am.
The dataset exhibits a significant class imbalance in terms of target classes. To mitigate model bias, it’s advisable to explore sampling techniques, such as undersampling. Focusing on transaction types like CASH_OUT and TRANSFER is recommended, as these are the only transaction types associated with fraudulent cases. Fraudulent transactions typically involve smaller monetary amounts. Fraudulent activities are more prevalent during the early hours, from midnight to 9 am.
Verify if transactions occur where the transaction amount exceeds the available balance in the origin account.
head(transactions[(transactions$amount > transactions$oldbalanceOrg)& (transactions$newbalanceDest > transactions$oldbalanceDest), c("amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud")], 10)
## amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest
## 11 9644.94 4465.00 0 10845 157982.12
## 16 229133.94 15325.00 0 5083 51513.44
## 25 311685.89 10835.00 0 6267 2719172.89
## 49 5346.89 0.00 0 652637 6453430.91
## 73 94253.33 25203.05 0 99773 965870.05
## 82 78766.03 0.00 0 103772 277515.05
## 84 125872.53 0.00 0 348512 3420103.09
## 85 379856.23 0.00 0 900180 19169204.93
## 86 1505626.01 0.00 0 29031 5515763.34
## 89 761507.39 0.00 0 1280036 19169204.93
## isFraud
## 11 0
## 16 0
## 25 0
## 49 0
## 73 0
## 82 0
## 84 0
## 85 0
## 86 0
## 89 0
I will filter the data by transaction type to include only CASH_OUT and TRANSFER. Additionally, we can exclude the columns “nameOrig” and “nameDest” due to the excessive number of unique levels, which would make it impractical to create dummy variables for them. Moreover, we can safely remove the “step” column since it was utilized to derive the “hour” attribute.
#Filtering transactions and drop irrelevant features
transactions1<- transactions %>%
select( -one_of('step','nameOrig', 'nameDest', 'isFlaggedFraud')) %>%
filter(type %in% c('CASH_OUT','TRANSFER'))
Every fraudulent transaction falls within the CASH_OUT and TRANSFER types, making other types irrelevant. Since the “step” column was used to generate the “hour” variable, we can safely discard it. Additionally, “nameOrig,” “nameDest,” and “isFlaggedFraud” serve no significant purpose and can be removed.
library(fastDummies)
## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
transactions1 <- dummy_cols(transactions1)
transactions1$isFraud <- as.factor(transactions1$isFraud)
transactions1 <- transactions1[,-1]
transactions1<-as.data.frame(transactions1)
#summarizeColumns(transactions1)
We understand that fraudulent transactions are infrequent. Given that fraudulent transactions account for only 0.1% of the dataset, employing duplication to balance the data is not the most effective approach. Instead, a more sensible strategy is to reduce the number of non-fraud cases through undersampling.
set.seed(12345)
train_id <- sample(seq_len(nrow(transactions1)), size = floor(0.7*nrow(transactions1)))
train <- transactions1[train_id,]
valid <- transactions1[-train_id,]
table(train$isFraud)
##
## 0 1
## 1933579 5707
table(valid$isFraud)
##
## 0 1
## 828617 2506
suppressMessages(library(ROSE))
set.seed(12345)
prop.table(table(train$isFraud))
##
## 0 1
## 0.997057164 0.002942836
library(caret)
## Loading required package: lattice
inputs <- train[,-6]
target <- train[,6]
# Downsample the data
down_train <- downSample(x = inputs, y = target)
# Calculate the proportions
proportions <- prop.table(table(down_train$Class))
# Print the proportions
print(proportions)
##
## 0 1
## 0.5 0.5
# To see it as percentages:
percentages <- prop.table(table(down_train$Class)) * 100
print(percentages)
##
## 0 1
## 50 50
By applying undersampling, we achieve a balanced training dataset consisting of 5,707 observations for each class. Nevertheless, it’s important to note that our validation dataset will still maintain its imbalanced nature.
I will explore tree-based modeling approaches like Decision Trees and Random Forests for predicting fraudulent transactions.
Using the undersampled data
Load the rpart package for decision trees Load necessary libraries
library(rpart)
library(rpart.plot) # for visualizing the decision tree
Build the decision tree model using the balanced data ‘down_train’ and using ‘Class’ as the dependent variable
model_dt <- rpart(Class ~ ., data = down_train)
# Plot the decision tree
prp(model_dt) # prp function is from the rpart.plot package
predict_dt <- predict(model_dt, valid, type = "class")
head(predict_dt)
## 2 6 7 10 13 14
## 1 1 0 0 0 0
## Levels: 0 1
confusionMatrix(valid$isFraud,predict_dt)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 775545 53072
## 1 120 2386
##
## Accuracy : 0.936
## 95% CI : (0.9355, 0.9365)
## No Information Rate : 0.9333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.077
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.99985
## Specificity : 0.04302
## Pos Pred Value : 0.93595
## Neg Pred Value : 0.95211
## Prevalence : 0.93327
## Detection Rate : 0.93313
## Detection Prevalence : 0.99698
## Balanced Accuracy : 0.52143
##
## 'Positive' Class : 0
##
1 Accuracy of the model is 93.6% ## Calculate the F1 score The F1 score serves as a weighted average of Precision and Recall. We can compute this metric using the provided information because:
Recall is synonymous with Sensitivity, Precision corresponds to Positive Predictive Value.
# Recall=Sensitivity
Recall_1<-0.99985
Precision_1<-0.93595
F1<-2*(Recall_1*Precision_1)/(Recall_1+Precision_1)
F1*100
## [1] 96.68453
## [1] 96.68453
The F1 score for the decision tree model is 96.68% which is very good.
Create a Random Forest model with default parameters
Create a Random Forest model with default parameters using the down_train data frame
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
rf1 <- randomForest(Class ~ ., data = down_train, importance = TRUE)
#Print the model output to see the results
print(rf1)
##
## Call:
## randomForest(formula = Class ~ ., data = down_train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.04%
## Confusion matrix:
## 0 1 class.error
## 0 5613 94 0.01647100
## 1 139 5568 0.02435605
rf2 <- randomForest(Class ~ ., data = down_train, ntree = 500, mtry = 6, importance = TRUE)
rf2
##
## Call:
## randomForest(formula = Class ~ ., data = down_train, ntree = 500, mtry = 6, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 1%
## Confusion matrix:
## 0 1 class.error
## 0 5626 81 0.014193096
## 1 33 5674 0.005782373
The estimated error rate has improved to 1%, which is better than the first model. Now, let’s explore one more random forest scenario. In this model, we will reduce the number of trees in the forest but maintain the same number of predictors included for the node split
rf3 <- randomForest(Class ~ ., data = down_train, ntree = 200, mtry = 6, importance = TRUE)
rf3
##
## Call:
## randomForest(formula = Class ~ ., data = down_train, ntree = 200, mtry = 6, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 1.03%
## Confusion matrix:
## 0 1 class.error
## 0 5625 82 0.014368320
## 1 35 5672 0.006132819
The estimated error rate in this scenario is 1.03%, which is slightly higher than in model 2. We will proceed to use model 2 to evaluate our validation dataset. In this step, we assess the predictive performance of our random forest model on the training dataset
# Predicting on train set
predTrain <- predict(rf2, train, type = "class")
# Checking classification accuracy
table(predTrain, train$isFraud)
##
## predTrain 0 1
## 0 1908967 0
## 1 24612 5707
The model demonstrates good performance on the training dataset, and now we’ll assess its effectiveness on the validation dataset.
# Predicting on Validation set
predValid <- predict(rf2, valid, type = "class")
# Checking classification accuracy
mean(predValid == valid$isFraud)
## [1] 0.9870537
## [1] 0.9870537
table(predValid,valid$isFraud)
##
## predValid 0 1
## 0 817876 19
## 1 10741 2487
The random forest model exhibited strong performance with an accuracy of 98.7%. Now, let’s compute the F1 score for this model.
TP=817876
TN=2487
FP=10741
FN=19
#Calculate Recall
Recall=(TP/(TP+FN))
Recall*100
## [1] 99.99768
## [1] 99.99768
# Calculate Precision
Precision=(TP/(TP+FP))
# Calculate F1
F1_rf<-2*(Recall_1*Precision_1)/(Recall_1+Precision_1)
F1_rf*100
## [1] 96.68453
The random forest model has an f1 score of 96.68%
To check important variables
importance(rf2)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## amount 45.885340 50.87388 61.26865 799.77750
## oldbalanceOrg 147.979430 80.20494 135.48595 3050.98588
## newbalanceOrig 189.878658 95.96116 188.50676 555.60244
## oldbalanceDest 50.807008 35.69202 55.76129 141.44153
## newbalanceDest 27.532772 47.85634 51.65584 913.59359
## hour 36.462088 21.92172 39.19226 159.80673
## type_CASH_OUT 3.136108 18.41215 18.50820 40.03161
## type_TRANSFER 3.576017 19.08424 19.13036 45.27567
varImpPlot(rf2)
The plots above provide insights into the variable importance in our random forest model, ranking them from most to least significant when making predictions. This aids in addressing the interpretability concerns associated with random forest models.
1.3 Model Conclusions Our decision tree model has: Accuracy of 93.6% F1 score of 96.68% Recall of 99.998
Our random forest model has: Accuracy of 98.7% F1 score of 96.68% Recall of 99.999
Typically, in this situation, the F1 score is used to determine the preferred model since it offers a balanced assessment of Precision and Recall. However, both models have the same F1 score. While the Recall is only 0.001 higher for the random forest model compared to the decision tree model, we will consider this as our scoring metric. The reason for choosing Recall over accuracy, even though both are viable, is due to the higher cost associated with False Negatives. Given that the company faces greater consequences from failing to detect a fraudulent transaction, it makes Recall the more suitable choice. Consequently, the company should opt for the random forest model for predicting fraud.
1.4 Conclusion Paysim provided simulated data for fraudulent mobile money transactions. The objective of this project was to achieve accurate predictions of fraudulent transactions using tree-based machine learning algorithms. Surprisingly, both models surpassed expectations, possibly attributed to the characteristics of the simulated data.
1.Explore alternative methods for fine-tuning model parameters. Additionally, consider employing clustering techniques to enhance the performance of the random forest model. 2.Investigate the utilization of other machine learning techniques such as XGBoost and Support Vector Machine algorithms, which may offer valuable insights and improve predictive capabilities. 3.Delve deeper into the reasons behind fraudulent transactions primarily occurring within TRANSFER and CASH_OUT transaction types, providing a more comprehensive understanding of the data’s underlying patterns and characteristics.