Detecting Online Payment Fraud Using Machine Learning Models

Lecturer: Dr. Ang Group 9

Matric Full Name
23121328 Mohammed Iqram
24052516 LI JUNMING
22106713 LI YUEXIN
23108677 ZHAO ZITONG
23111676 LIU YICONG

Introduction

Fraud detection is an essential field of study due to the increasing prevalence of online transactions and digital payment systems. With billions of dollars lost annually to fraudulent activities, detecting and preventing fraud has become a top priority for organizations to protect financial resources and maintain customer trust. Fraudulent transactions often constitute a small portion of all activities, making detection a challenging task due to data imbalance and the constantly evolving nature of fraudulent patterns.

This study focuses on understanding the key features that contribute to the likelihood of fraud and developing a predictive model to assign risk scores to transactions. By identifying significant features and predicting the likelihood of fraud, businesses can prioritize high-risk cases for investigation, reducing false positives and improving detection efficiency.

Questions:

  1. Can accurately classify a transaction as fraudulent or non-fraudulent using machine learning models? (Classification)
  2. What features contribute most significantly to the likelihood of fraud, and can predict the risk score? (Regression)

Objectives:

  1. To accurately classify a transaction as fraudulent or non-fraudulent using machine learning models.
  2. To determine features contribute most significantly to the likelihood of fraud.
# Load necessary library
library(readr)
## Warning: 程序包'readr'是用R版本4.4.2 来建造的
library(dplyr)
## Warning: 程序包'dplyr'是用R版本4.4.2 来建造的
## 
## 载入程序包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(fastDummies)
## Warning: 程序包'fastDummies'是用R版本4.4.2 来建造的
library(digest)
## Warning: 程序包'digest'是用R版本4.4.2 来建造的
library(tidyr)
## Warning: 程序包'tidyr'是用R版本4.4.2 来建造的
library(ggplot2)
## Warning: 程序包'ggplot2'是用R版本4.4.2 来建造的
library(scales)
## Warning: 程序包'scales'是用R版本4.4.2 来建造的
## 
## 载入程序包:'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(caret)
## Warning: 程序包'caret'是用R版本4.4.2 来建造的
## 载入需要的程序包:lattice
library(tidyverse)   
## Warning: 程序包'tidyverse'是用R版本4.4.2 来建造的
## Warning: 程序包'tibble'是用R版本4.4.2 来建造的
## Warning: 程序包'purrr'是用R版本4.4.2 来建造的
## Warning: 程序包'stringr'是用R版本4.4.2 来建造的
## Warning: 程序包'forcats'是用R版本4.4.2 来建造的
## Warning: 程序包'lubridate'是用R版本4.4.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ purrr::discard()     masks scales::discard()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ purrr::lift()        masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest) 
## Warning: 程序包'randomForest'是用R版本4.4.2 来建造的
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## 载入程序包:'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(pROC)
## Warning: 程序包'pROC'是用R版本4.4.2 来建造的
## Type 'citation("pROC")' for a citation.
## 
## 载入程序包:'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(corrplot)
## Warning: 程序包'corrplot'是用R版本4.4.2 来建造的
## corrplot 0.95 loaded
library(reshape2)
## Warning: 程序包'reshape2'是用R版本4.4.2 来建造的
## 
## 载入程序包:'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

1.0 The Details of Raw data

rawdata <- read.csv('dirty_data.csv')
head(rawdata) # check first six rows of rawdata
##   step     type   amount    nameOrig oldbalanceOrg newbalanceOrig    nameDest
## 1    1  PAYMENT  9839.64 C1231006815        170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295         21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145           181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671           181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720         41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638         53860       46042.29  M573487274
##   oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## 1              0              0       0              0
## 2              0              0       0              0
## 3              0              0       1              0
## 4          21182              0       1              0
## 5              0              0       0              0
## 6              0              0       0              0

Basic information about the dataset

Title: onlinefraud.csv

Year:2022

Purpose of dataset:Online paymentf Fraud detection by machine learning

Dimension:
Number of rows: 1,048,575
Number of columns: 11

Data size: 493.53 MB

Meaning of Column Names:

Column Name Description
step Represents a unit of time where 1 step equals 1 hour
type Type of online transaction
amount The amount of the transaction
nameOrig Customer starting the transaction
oldbalanceOrg Balance before the transaction
newbalanceOrig Balance after the transaction
nameDest Recipient of the transaction
oldbalanceDest Initial balance of recipient before the transaction
newbalanceDest New balance of recipient after the transaction
isFraud Indicates whether the transaction is fraudulent (1 = Fraud)
# Check the dimension of the dataset
dim(rawdata)
## [1] 1048575      11
# Structure of the dataset
str(rawdata)
## 'data.frame':    1048575 obs. of  11 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : chr  "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ isFlaggedFraud: int  0 0 0 0 0 0 0 0 0 0 ...
# Summary of the dataset
summary(rawdata)
##       step           type               amount           nameOrig        
##  Min.   : 1.00   Length:1048575     Min.   :       0   Length:1048575    
##  1st Qu.:15.00   Class :character   1st Qu.:   12149   Class :character  
##  Median :20.00   Mode  :character   Median :   76343   Mode  :character  
##  Mean   :26.97                      Mean   :  158667                     
##  3rd Qu.:39.00                      3rd Qu.:  213762                     
##  Max.   :95.00                      Max.   :10000000                     
##                                     NA's   :6                            
##  oldbalanceOrg      newbalanceOrig       nameDest         oldbalanceDest    
##  Min.   :       0   Min.   :       0   Length:1048575     Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   Class :character   1st Qu.:       0  
##  Median :   16002   Median :       0   Mode  :character   Median :  126377  
##  Mean   :  874010   Mean   :  893809                      Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600                      3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000                      Max.   :42100000  
##                                                                             
##  newbalanceDest        isFraud         isFlaggedFraud
##  Min.   :       0   Min.   :0.000000   Min.   :0     
##  1st Qu.:       0   1st Qu.:0.000000   1st Qu.:0     
##  Median :  218260   Median :0.000000   Median :0     
##  Mean   : 1114198   Mean   :0.001089   Mean   :0     
##  3rd Qu.: 1149808   3rd Qu.:0.000000   3rd Qu.:0     
##  Max.   :42200000   Max.   :1.000000   Max.   :0     
## 

2.0 Data Cleaning

data <- rawdata # for data cleaning purpose

2.1 Drop no meaning column

data <- rawdata %>% select(-"isFlaggedFraud")  # df <- df %>% select(-B)
head(data)
##   step     type   amount    nameOrig oldbalanceOrg newbalanceOrig    nameDest
## 1    1  PAYMENT  9839.64 C1231006815        170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295         21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145           181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671           181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720         41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638         53860       46042.29  M573487274
##   oldbalanceDest newbalanceDest isFraud
## 1              0              0       0
## 2              0              0       0
## 3              0              0       1
## 4          21182              0       1
## 5              0              0       0
## 6              0              0       0

2.2 Check for Null Values

colSums(is.na(data))
##           step           type         amount       nameOrig  oldbalanceOrg 
##              0              0              6              0              0 
## newbalanceOrig       nameDest oldbalanceDest newbalanceDest        isFraud 
##              0              0              0              0              0

It shows that the Amount Column has 6 Null values

2.3 Fill the Blank Values

data$amount[is.na(data$amount)] <-
  data$oldbalanceOrg[is.na(data$amount)] - data$newbalanceOrig[is.na(data$amount)]

The blanks are filled logically by ( amount = oldbalanceOrg - newbalanceOrig)

sum(is.na(data))
## [1] 0

There are no blank values

2.4 Check for Duplicates

sum(duplicated(data))
## [1] 0

No duplicates found

2.5 Data Type Validation

str(data)
## 'data.frame':    1048575 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : chr  "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
data$type <- as.factor(data$type)
data$isFraud <- as.logical(data$isFraud)
data$amount <- as.numeric(data$amount)
data$oldbalanceOrg <- as.numeric(data$oldbalanceOrg)
data$newbalanceOrig <- as.numeric(data$newbalanceOrig)
data$oldbalanceDest <- as.numeric(data$oldbalanceDest)
data$newbalanceDest <- as.numeric(data$newbalanceDest)
summary(data)
##       step             type            amount           nameOrig        
##  Min.   : 1.00   CASH_IN :227130   Min.   : -227335   Length:1048575    
##  1st Qu.:15.00   CASH_OUT:373641   1st Qu.:   12149   Class :character  
##  Median :20.00   DEBIT   :  7178   Median :   76342   Mode  :character  
##  Mean   :26.97   PAYMENT :353873   Mean   :  158666                     
##  3rd Qu.:39.00   TRANSFER: 86753   3rd Qu.:  213762                     
##  Max.   :95.00                     Max.   :10000000                     
##  oldbalanceOrg      newbalanceOrig       nameDest         oldbalanceDest    
##  Min.   :       0   Min.   :       0   Length:1048575     Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   Class :character   1st Qu.:       0  
##  Median :   16002   Median :       0   Mode  :character   Median :  126377  
##  Mean   :  874010   Mean   :  893809                      Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600                      3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000                      Max.   :42100000  
##  newbalanceDest      isFraud       
##  Min.   :       0   Mode :logical  
##  1st Qu.:       0   FALSE:1047433  
##  Median :  218260   TRUE :1142     
##  Mean   : 1114198                  
##  3rd Qu.: 1149808                  
##  Max.   :42200000

2.6 Check of Outlier

boxplot(data$amount, main = "Boxplot for Amount", ylab = "Amount")

boxplot(data$oldbalanceOrg, main = "Boxplot for Old Balance Origin", ylab = "Old Balance Origin")

boxplot(data$newbalanceOrig, main = "Boxplot for New Balance Origin", ylab = "New Balance Origin")

boxplot(data$oldbalanceDest, main = "Boxplot for Old Balance Des", ylab = "Old Balance Des")

boxplot(data$newbalanceDest, main = "Boxplot for New Balance Des", ylab = "New Balance Des")

I am not removing data outliers as they represent valid financial transactions and removing them would lead to a loss of valuable information and potentially biased results.

2.7 Ensure Balance Consistency

# Identify inconsistent rows
inconsistent_rows <- data[data$newbalanceOrig > data$oldbalanceOrg |
                          data$newbalanceDest > data$oldbalanceDest, ]
cat("Inconsistent rows:", nrow(inconsistent_rows), "\n")
## Inconsistent rows: 661442
# Keep only consistent rows
data <- data[data$newbalanceOrig <= data$oldbalanceOrg &
             data$newbalanceDest <= data$oldbalanceDest, ]
cat(all(data$newbalanceOrig <= data$oldbalanceOrg), "\n")
## TRUE
cat(all(data$newbalanceDest <= data$oldbalanceDest), "\n")
## TRUE
str(data)
## 'data.frame':    387133 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : logi  FALSE FALSE TRUE TRUE FALSE FALSE ...
# Keep only consistent rows:38,7133

2.8 Categorical Data Validation

unique(data$type)
## [1] PAYMENT  TRANSFER CASH_OUT DEBIT    CASH_IN 
## Levels: CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
# unique(data$nameOrig)
# unique(data$nameDest)

Drop rows with type:“CASH_IN”, which is unexpected type to detect fraud

data <- data[data$type != "CASH_IN", ]
cat("Frequency table for 'type':\n")
## Frequency table for 'type':
print(table(data$type))
## 
##  CASH_IN CASH_OUT    DEBIT  PAYMENT TRANSFER 
##        0    24731     1036   353873     3126
str(data)
## 'data.frame':    382766 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : logi  FALSE FALSE TRUE TRUE FALSE FALSE ...

after droped “CASH_IN”: 38,7133 observations >> 38,2766 observations

2.9 One-Hot Encoding for categorical variables(type)

data <- dummy_cols(
  data,
  select_columns = c("type"),
  remove_selected_columns = TRUE
)

Hashing Encoding for categorical variables (nameOrig, nameDest) to convert them into numeric format.

data$nameOrig <- as.integer(sapply(data$nameOrig, function(x) {
  hash_value <- digest(x, algo = "xxhash32", seed = 123)
  as.numeric(paste0("0x", hash_value)) %% 1000 #num_buckets could change based on model
}))
data$nameDest <- as.integer(sapply(data$nameDest, function(x) {
  hash_value <- digest(x, algo = "xxhash32", seed = 123)
  as.numeric(paste0("0x", hash_value)) %% 1000
}))

2.10 Feature Engineering

# Convert steps into days and create bins for different time periods
data <- data %>%
  mutate(day = step %/% 24 + 1 )%>%
  mutate(period = case_when(
    step %% 24 >= 0 & step %% 24 < 6 ~ "Night",
    step %% 24 >= 6 & step %% 24 < 12 ~ "Morning",
    step %% 24 >= 12 & step %% 24 < 18 ~ "Afternoon",
    TRUE ~ "Evening"
  ))

# Calculate ratio of transaction amount to the old balance and balance change
# For the originating account
data <- data %>%
  mutate(ratio_orig = ifelse(oldbalanceOrg > 0, amount / oldbalanceOrg, 0))%>%
  mutate(change_orig = oldbalanceOrg - newbalanceOrig)

# For the destination account
data <- data %>%
  mutate(ratio_dest = ifelse(oldbalanceDest > 0, amount / oldbalanceDest, 0)) %>%
  mutate(change_dest = oldbalanceDest - newbalanceDest)

# Difference between origin and destination balance change
data <- data %>%
  mutate(change_diff = change_orig - change_dest)

3.0 Exploratory Data Analysis(EDA)

head(data)
##   step   amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1    1  9839.64      498        170136      160296.36      563              0
## 2    1  1864.28      727         21249       19384.72      613              0
## 3    1   181.00      293           181           0.00      124              0
## 4    1   181.00      977           181           0.00       92          21182
## 5    1 11668.14      202         41554       29885.86       10              0
## 6    1  7817.71      259         53860       46042.29      721              0
##   newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1              0   FALSE            0             0          0            1
## 2              0   FALSE            0             0          0            1
## 3              0    TRUE            0             0          0            0
## 4              0    TRUE            0             1          0            0
## 5              0   FALSE            0             0          0            1
## 6              0   FALSE            0             0          0            1
##   type_TRANSFER day period ratio_orig change_orig  ratio_dest change_dest
## 1             0   1  Night 0.05783397     9839.64 0.000000000           0
## 2             0   1  Night 0.08773495     1864.28 0.000000000           0
## 3             1   1  Night 1.00000000      181.00 0.000000000           0
## 4             0   1  Night 1.00000000      181.00 0.008544991       21182
## 5             0   1  Night 0.28079463    11668.14 0.000000000           0
## 6             0   1  Night 0.14514872     7817.71 0.000000000           0
##   change_diff
## 1     9839.64
## 2     1864.28
## 3      181.00
## 4   -21001.00
## 5    11668.14
## 6     7817.71

3.1 The distribution of isFraud

Plot of Transaction Types by Fraud Status

options(scipen = 999)
plot_data <- data %>%
  group_by(isFraud) %>%
  summarise(
    CASH_OUT = sum(type_CASH_OUT),
    DEBIT = sum(type_DEBIT),
    PAYMENT = sum(type_PAYMENT),
    TRANSFER = sum(type_TRANSFER)
  )


plot_data_long <- plot_data %>%
  pivot_longer(
    cols = c(CASH_OUT, DEBIT, PAYMENT, TRANSFER),
    names_to = "Transaction_Type",
    values_to = "Value"
  )


plot_data_long$isFraud <- as.factor(plot_data_long$isFraud)


ggplot(plot_data_long, aes(x = Transaction_Type, y = Value, fill = isFraud)) +
  geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8), color = "black") +
  geom_text(aes(label = Value),
            vjust = -0.5,
            size = 3,
            position = position_dodge(width = 0.8),
            color = "black") +
  scale_y_continuous(trans = "identity") +
  labs(
    title = "Count Plot of Transaction Types by Fraud Status",
    x = "Transaction Type",
    y = "Number of Transactions",
    fill = "Fraud Status"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "right"
  ) +
  scale_fill_manual(
    values = c("FALSE" = "skyblue", "TRUE" = "#FF6666"),
    labels = c("Non-Fraudulent", "Fraudulent")
  )

3.2 Numerical Feature Distribution

Histograms for numerical columns

cols_of_interest <- c("amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")

data_filtered <- data[cols_of_interest]

for (col_name in cols_of_interest) {

  log_col_name <- paste0("Log_", col_name)
  data_filtered[[log_col_name]] <- log(data_filtered[[col_name]] + 1)

  p <- ggplot(data_filtered, aes(x = .data[[log_col_name]])) +
    geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
    labs(title = paste("Histogram of Log-transformed", col_name),
         x = paste("Log(", col_name, " + 1)", sep = ""),
         y = "Frequency") +
    theme_minimal()

  print(p)
}

3.3 Correlation Analysis

num_cols <- sapply(data, is.numeric)
cor_matrix <- cor(data[, num_cols], use = "complete.obs")
## Warning in cor(data[, num_cols], use = "complete.obs"): 标准差为零
cor_matrix <- cor_matrix[rownames(cor_matrix) != "type_CASH_IN", colnames(cor_matrix) != "type_CASH_IN"]

dist_matrix <- as.dist(1 - cor_matrix)

if (any(is.na(dist_matrix))) {
    stop("Distance matrix contains NA values.")
}
if (any(dist_matrix < 0, na.rm = TRUE)) {
    stop("Distance matrix contains negative values.")
}

hclust_order <- hclust(dist_matrix)$order

cor_matrix <- cor_matrix[hclust_order, hclust_order]

library(corrplot)
corrplot(
    cor_matrix,
    method = "circle",
    type = "full",
    tl.cex = 0.8,
    cl.cex = 0.8,
    tl.col = "black",
    addrect = 2
)

3.4 Balance Analysis

head(data)
##   step   amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1    1  9839.64      498        170136      160296.36      563              0
## 2    1  1864.28      727         21249       19384.72      613              0
## 3    1   181.00      293           181           0.00      124              0
## 4    1   181.00      977           181           0.00       92          21182
## 5    1 11668.14      202         41554       29885.86       10              0
## 6    1  7817.71      259         53860       46042.29      721              0
##   newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1              0   FALSE            0             0          0            1
## 2              0   FALSE            0             0          0            1
## 3              0    TRUE            0             0          0            0
## 4              0    TRUE            0             1          0            0
## 5              0   FALSE            0             0          0            1
## 6              0   FALSE            0             0          0            1
##   type_TRANSFER day period ratio_orig change_orig  ratio_dest change_dest
## 1             0   1  Night 0.05783397     9839.64 0.000000000           0
## 2             0   1  Night 0.08773495     1864.28 0.000000000           0
## 3             1   1  Night 1.00000000      181.00 0.000000000           0
## 4             0   1  Night 1.00000000      181.00 0.008544991       21182
## 5             0   1  Night 0.28079463    11668.14 0.000000000           0
## 6             0   1  Night 0.14514872     7817.71 0.000000000           0
##   change_diff
## 1     9839.64
## 2     1864.28
## 3      181.00
## 4   -21001.00
## 5    11668.14
## 6     7817.71

Scatter Plots

Investigate consistency of balances. First scatter plot help to understand the relationship between the sender’s original and new balances after transactions, also helps detect unusual or unexpected patterns, like abrupt zero balances or discrepancies. Second scatter plot aims to study how transactions impact the recipient’s account balance. And helps spot anomalies where balances remain unchanged (potential red flags for fraud or simulation.

# Scatter plot for oldbalanceOrg vs newbalanceOrig
plotorg <- ggplot(data, aes(x = oldbalanceOrg, y = newbalanceOrig)) +
  geom_point(alpha = 0.5, color = "blue") +
  labs(title = "Scatter Plot: oldbalanceOrg vs newbalanceOrig",
       x = "Old Balance Origin",
       y = "New Balance Origin") +
  theme_minimal()

print(plotorg)

# Scatter plot for oldbalanceDest vs newbalanceDest
plotdest <- ggplot(data, aes(x = oldbalanceDest, y = newbalanceDest)) +
  geom_point(alpha = 0.5, color = "lightgreen") +
  labs(title = "Scatter Plot: oldbalanceDest vs newbalanceDest",
       x = "Initial balance of recipient before the transaction",
       y = "New balance of recipient after the transaction") +
  theme_minimal()

# Print the second plot
print(plotdest)

3.5 Fraud Pattern Detection

To investigate fraud patterns by comparing amounts and balance changes for fraudulent and non-fraudulent transactions.

# Violin plot for Transaction Amount by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = amount, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  scale_y_log10() + # Use log scale for better visualization if amounts vary widely
  labs(title = "Transaction Amount Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Transaction Amount (log scale)") +
  theme_minimal() +
  theme(legend.position = "none")

The distribution of transaction amounts for both fraudulent and non-fraudulent transactions. The shape of the violin indicates the density (frequency) of transactions at various amount levels. Besides, log scale is used to better visualize the wide range of transaction amounts. If the fraudulent distribution is skewed towards higher amounts compared to non-fraudulent transactions, it suggests fraud often involves larger sums of money. And peaks (wider sections) indicate common transaction amounts. Other side, narrow tails in the non-fraudulent category might indicate a consistent range of typical transaction amounts, while broader tails for fraud might highlight its unpredictability.

# Violin plot for Balance Changes (oldbalanceOrg - newbalanceOrig) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = oldbalanceOrg - newbalanceOrig, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  labs(title = "Origin Balance Change Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Balance Change (oldbalanceOrg - newbalanceOrig)") +
  theme_minimal() +
  theme(legend.position = "none")

  • Non-Fraudulent Transactions (FALSE)

    The distribution is very narrow, with most balance changes concentrated near small values (close to 0). This indicates that for non-fraudulent transactions, the sender’s balance changes are typically modest and predictable. The balance change pattern is consistent with legitimate transactions, where the deducted amount aligns with normal activities.

  • Fraudulent Transactions (TRUE)

    The distribution is much broader, and balance changes can be extremely large, reaching values in the millions. This suggests that fraudulent transactions often involve significant deductions from the sender’s account. The sharp contrast in balance change magnitude highlights that fraud frequently involves outliers (large, irregular transactions).

# Violin plot for Balance Changes (newbalanceDest - oldbalanceDest) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = newbalanceDest - oldbalanceDest, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  labs(title = "Destination Balance Change Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Balance Change (newbalanceDest - oldbalanceDest)") +
  theme_minimal() +
  theme(legend.position = "none")

  • Non-Fraudulent Transactions (FALSE)

    The distribution is very narrow, with balance changes clustered near zero. This indicates that in most legitimate transactions, recipients’ balances either increase modestly or stay consistent with expected transaction behavior. The low variability suggests predictable and consistent crediting in legitimate transactions.

  • Fraudulent Transactions (TRUE)

    The distribution for fraudulent transactions also exhibits a narrow range, but there are significant outliers. These outliers indicate: Large negative balance changes, where credits do not align with the deducted amounts from the sender. Inconsistent or unusual behavior, possibly due to incomplete or simulated transactions.

4.0 Modeling

4.1 Classification Model: random forest

# Split data sets
data_rf <- data
data_rf$isFraud <- as.factor(data_rf$isFraud)
levels(data_rf$isFraud) <- c("NonFraud", "Fraud") 
set.seed(123)
train_index <- createDataPartition(data_rf$isFraud, p = 0.7, list = FALSE)
train_data <- data_rf[train_index, ]
test_data <- data_rf[-train_index, ]
#Building model 


n_features <- ncol(data_rf) - 1
# Setting up cross-validation controls for random searches
train_control_random <- trainControl(
  method = "cv",        
  number = 5,           
  classProbs = TRUE,    
  summaryFunction = twoClassSummary 
)
# Random search
set.seed(123)
rf_random_search <- train(
  isFraud ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  tuneLength = 10, 
  trControl = train_control_random,
  ntree = 100     
)

print(rf_random_search)
## Random Forest 
## 
## 267937 samples
##     20 predictor
##      2 classes: 'NonFraud', 'Fraud' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9999890  0.9999925  0.9417722
##    4    0.9999973  0.9999888  0.9544304
##    6    0.9999956  0.9999888  0.9518987
##    8    0.9999950  0.9999850  0.9518987
##   10    0.9999939  0.9999888  0.9518987
##   13    0.9999717  0.9999888  0.9518987
##   15    0.9949039  0.9999888  0.9518987
##   17    0.9949044  0.9999888  0.9518987
##   19    0.9873092  0.9999888  0.9518987
##   22    0.9885763  0.9999888  0.9518987
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
best_mtry <- rf_random_search$bestTune$mtry
cat("Best mtry from Random Search:", best_mtry, "\n")
## Best mtry from Random Search: 4

Due to the large amount of data, we first used random search to narrow down the tuning parameters. A grid search was then used to determine the exact ground parameters. This reduces runtime and improves efficiency.

# Grid Search
# Narrow down mtry based on random search results
rf_grid <- expand.grid(
  mtry = seq(max(1, best_mtry - 2), best_mtry + 2, by = 1) #Make sure mtry is not less than 1
)

# Setting up cross-validation controls for grid searches
train_control_grid <- trainControl(
  method = "cv",        # cross-validation
  number = 5,           # 5-fold 
  classProbs = TRUE,    
  summaryFunction = twoClassSummary # Use of ROC as an assessment indicator
)

# Grid Search
set.seed(123)
rf_grid_search <- train(
  isFraud ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  tuneGrid = rf_grid,
  trControl = train_control_grid,
  ntree = 100
)


print(rf_grid_search)
## Random Forest 
## 
## 267937 samples
##     20 predictor
##      2 classes: 'NonFraud', 'Fraud' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   2     0.9999898  0.9999925  0.9518987
##   3     0.9999943  0.9999888  0.9518987
##   4     0.9999965  0.9999888  0.9518987
##   5     0.9999958  0.9999888  0.9544304
##   6     0.9999974  0.9999850  0.9569620
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
cat("Best mtry from Grid Search:", rf_grid_search$bestTune$mtry, "\n")
## Best mtry from Grid Search: 6
# Model Evaluation
# Evaluate the final model on the test set
final_model <- rf_grid_search$finalModel
test_predictions <- predict(rf_grid_search, newdata = test_data, type = "prob")[, 2]

# Calculate AUC
test_roc <- roc(test_data$isFraud, test_predictions)
## Setting levels: control = NonFraud, case = Fraud
## Setting direction: controls < cases
cat("Test AUC:", auc(test_roc), "\n")
## Test AUC: 0.9970317
# Plotting ROC curves
plot(test_roc, main = "ROC Curve for Final Model")

As can be seen from the provided ROC curve graphs, the model’s classification performance is very strong. The curve is close to the upper left corner, indicating that the model performs well in distinguishing between positive (fraudulent) and negative (non-fraudulent) samples with high Sensitivity and Specificity. In addition, the area under the ROC curve (AUC) is expected to be close to 1, which further validates that the model’s classification ability is almost optimal under different thresholds. However, this near-perfect performance may require vigilance against the possibility of overfitting, especially if the test data is too similar to the training data distribution or there is a data leakage problem. A combination of confusion matrices, classification metrics (e.g., precision and recall), and further examination of the data distribution is needed to ensure the robustness and generalisation ability of the model.

# Confusion Matrix
test_predictions_class <- predict(rf_grid_search, newdata = test_data, type = "raw")
confusion_matrix <- confusionMatrix(test_predictions_class, test_data$isFraud)
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NonFraud  Fraud
##   NonFraud   114659      2
##   Fraud           1    167
##                                              
##                Accuracy : 1                  
##                  95% CI : (0.9999, 1)        
##     No Information Rate : 0.9985             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9911             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.9882             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 0.9940             
##              Prevalence : 0.9985             
##          Detection Rate : 0.9985             
##    Detection Prevalence : 0.9985             
##       Balanced Accuracy : 0.9941             
##                                              
##        'Positive' Class : NonFraud           
## 

From the confusion matrix and statistical metrics, the Random Forest model performed extremely well in the classification task. The confusion matrix shows that the model had very few misclassifications in the test set, with 2 predictions in the NonFraud category being misclassified as Fraud and only 1 prediction in the Fraud category being misclassified as NonFraud. the overall Accuracy of the model is 1, with an almost perfect 95% confidence interval (0.9999, 1). The Sensitivity of 1 indicates that the model correctly detects all positive samples (NonFraud), while the Specificity of 0.9882 indicates that the model is also very accurate in classifying negative samples (Fraud). The Positive Pred Value (Pos Pred Value) of 1 indicates that all samples predicted as NonFraud are true NonFraud, while the Negative Pred Value (Neg Pred Value) of 0.994 indicates that most of the samples predicted as Fraud are also accurate.

4.2 Regression Model

data_logit <- data
data_logit$isFraud <- as.factor(data_logit$isFraud)
levels(data_logit$isFraud) <- c("NonFraud", "Fraud")
set.seed(123)
train_index <- createDataPartition(data_logit$isFraud, p = 0.7, list = FALSE)
train_data <- data_logit[train_index, ]
test_data <- data_logit[-train_index, ]
logit_model <- glm(isFraud ~ ., data = train_data, family = binomial, control = list(maxit = 50))
## Warning: glm.fit:拟合概率算出来是数值零或一
summary(logit_model)
## 
## Call:
## glm(formula = isFraud ~ ., family = binomial, data = train_data, 
##     control = list(maxit = 50))
## 
## Coefficients: (4 not defined because of singularities)
##                                      Estimate                     Std. Error
## (Intercept)    -3042866434641853837006482.000          46921597419489696.000
## step                        4110946821902.652                      84405.344
## amount                        -1732453559.117                          3.742
## nameOrig                      12567222501.885                        448.652
## oldbalanceOrg                  3305655614.222                          3.872
## newbalanceOrig                -6655561949.567                          3.954
## nameDest                      -5066725870.390                        449.323
## oldbalanceDest                -1631068099.140                          2.150
## newbalanceDest                 1453817041.490                          2.149
## type_CASH_IN                               NA                             NA
## type_CASH_OUT   3042866434306248313600224.000          46921597419361032.000
## type_DEBIT      3042866433682500413866662.000          46921597419363176.000
## type_PAYMENT    3042866434390436115200408.000          46921597419363704.000
## type_TRANSFER   3042866434561473524068842.000          46921597419325040.000
## day                       -64363358160846.922                    2066016.713
## periodEvening              16153548717839.422                     546064.555
## periodMorning              46468472513322.094                     533547.603
## periodNight               134108480308588.109                    1425457.084
## ratio_orig                    -4340634001.569                        177.209
## change_orig                                NA                             NA
## ratio_dest                  -111963329231.272                        774.756
## change_dest                                NA                             NA
## change_diff                                NA                             NA
##                    z value            Pr(>|z|)    
## (Intercept)      -64850018 <0.0000000000000002 ***
## step              48704817 <0.0000000000000002 ***
## amount          -462967637 <0.0000000000000002 ***
## nameOrig          28011071 <0.0000000000000002 ***
## oldbalanceOrg    853748967 <0.0000000000000002 ***
## newbalanceOrig -1683415101 <0.0000000000000002 ***
## nameDest         -11276365 <0.0000000000000002 ***
## oldbalanceDest  -758741709 <0.0000000000000002 ***
## newbalanceDest   676479802 <0.0000000000000002 ***
## type_CASH_IN            NA                  NA    
## type_CASH_OUT     64850018 <0.0000000000000002 ***
## type_DEBIT        64850018 <0.0000000000000002 ***
## type_PAYMENT      64850018 <0.0000000000000002 ***
## type_TRANSFER     64850018 <0.0000000000000002 ***
## day              -31153358 <0.0000000000000002 ***
## periodEvening     29581757 <0.0000000000000002 ***
## periodMorning     87093396 <0.0000000000000002 ***
## periodNight       94081037 <0.0000000000000002 ***
## ratio_orig       -24494500 <0.0000000000000002 ***
## change_orig             NA                  NA    
## ratio_dest      -144514367 <0.0000000000000002 ***
## change_dest             NA                  NA    
## change_diff             NA                  NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5939.9  on 267936  degrees of freedom
## Residual deviance: 6487.9  on 267918  degrees of freedom
## AIC: 6525.9
## 
## Number of Fisher Scoring iterations: 27
test_data$predicted_prob <- predict(logit_model, newdata = test_data, type = "response")

test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, "Fraud", "NonFraud")
test_data$predicted_class <- factor(test_data$predicted_class, levels = levels(data_logit$isFraud))

conf_matrix <- confusionMatrix(
  test_data$predicted_class,
  test_data$isFraud
)  

conf_matrix_table <- as.table(conf_matrix)
conf_matrix_df <- as.data.frame(conf_matrix_table)
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)
misclassification_rate <- 1 - accuracy
specificity <- conf_matrix$byClass['Specificity']

metrics_df <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1 Score", "Misclassification Rate", "Specificity"),
  Value = c(accuracy, precision, recall, f1_score, misclassification_rate, specificity)
)

print(metrics_df)
##                   Metric        Value
## 1               Accuracy 0.9997561592
## 2              Precision 0.9998866193
## 3                 Recall 0.9998691784
## 4               F1 Score 0.9998778988
## 5 Misclassification Rate 0.0002438408
## 6            Specificity 0.9230769231
mcnemar_test <- mcnemar.test(conf_matrix$table)
cat("McNemar's Test P-Value:", mcnemar_test$p.value, "\n")
## McNemar's Test P-Value: 0.8501067
roc_curve <- roc(test_data$isFraud, test_data$predicted_prob)
## Setting levels: control = NonFraud, case = Fraud
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
## AUC: 0.9614731
plot(roc_curve, main = "ROC Curve", col = "blue")

feature_importance <- summary(logit_model)$coefficients

importance_df <- data.frame(
  Feature = rownames(feature_importance),
  Coefficient = feature_importance[, 1],
  PValue = feature_importance[, 4]
)

importance_df <- importance_df[!grepl("Intercept", importance_df$Feature), ]
importance_df <- na.omit(importance_df)

feature_summary_df <- importance_df %>%
  mutate(Var1 = ifelse(Coefficient > 0, "Positive", "Negative")) %>%
  count(Var1) %>%
  rename(Freq = n) %>%
  mutate(Percentage = Freq / sum(Freq) * 100)

ggplot(feature_summary_df, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Distribution of Feature Coefficients (Positive vs Negative)", x = NULL, y = NULL) +
  theme_void() +
  scale_fill_manual(values = c("skyblue", "lightcoral")) +
  geom_text(aes(label = paste(Var1, "\n", round(Percentage, 1), "%")), position = position_stack(vjust = 0.5), size = 5, color = "white") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(importance_df, aes(x = reorder(Feature, Coefficient), y = Coefficient, fill = Coefficient > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("skyblue", "lightcoral")) +
  labs(title = "Feature Importance (Coefficients) from Logistic Regression",
       x = "Features",
       y = "Coefficient Value") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10),
        plot.title = element_text(hjust = 0.5))

test_data$risk_score <- test_data$predicted_prob
test_data$risk_category <- ifelse(test_data$predicted_prob > 0.7, "High Risk",
                                 ifelse(test_data$predicted_prob > 0.3, "Medium Risk", "Low Risk"))

head(test_data[, c("predicted_prob", "risk_score", "risk_category")])
##              predicted_prob               risk_score risk_category
## 1  0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 2  0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 5  0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 16 0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 19 0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 30 0.0000000000000002220446 0.0000000000000002220446      Low Risk
ggplot(test_data, aes(x = risk_category)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Distribution of Risk Categories", x = "Risk Category", y = "Count") +
  theme_minimal()

ggplot(conf_matrix_df, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "blue", name = "Frequency") +
  geom_text(aes(label = Freq), color = "white", size = 5) +
  labs(
    title = "Confusion Matrix Visualization",
    x = "Predicted Class",
    y = "Actual Class"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

5.0 Result Discussion

5.1 Classification Model: random forest

The model demonstrated exceptional performance, achieving an overall accuracy of 1 with a sensitivity of 1.0 and specificity of 0.9882, as reflected in the confusion matrix. These results indicate that the model correctly identified almost all non-fraudulent and fraudulent transactions, with only three misclassifications in total. The positive predictive value (precision) of 1.0 for the non-fraudulent class further demonstrates the model’s reliability in avoiding false positives, while a Kappa score of 0.9911 confirms a near-perfect agreement between predictions and actual labels. These metrics strongly suggest that the Random Forest model not only meets but exceeds the objective, making it a highly effective tool for fraud detection.

Despite its strong performance, several considerations need to be addressed to validate the model’s robustness. Fraudulent transactions are typically rare, raising concerns about potential overfitting or reliance on majority class patterns, even with strong metrics. Additionally, the almost perfect results may indicate a potential similarity between the training and testing datasets, or even a risk of data leakage, which should be ruled out by further testing on external datasets. To ensure broader applicability, the model should also be tested on more imbalanced and diverse datasets to assess its generalization capability. Beyond its predictive accuracy, the Random Forest model provides interpretability through feature importance analysis, offering actionable insights into the key drivers of fraud. These insights can be invaluable for organizations to strengthen fraud prevention strategies and optimize their operations.

5.2 Regression Model

In regression model, accuracy stands at an impressive 99.98%, demonstrating that the model is highly accurate and correctly predicts outcomes for nearly all cases. This near-perfect accuracy suggests robust performance across various scenarios. Besides, precision reaches an extraordinary 99.99%, indicating exceptional reliability when the model predicts a positive class, such as in fraud detection. This means that when the model identifies a positive instance, it is almost always correct, minimizing false positive errors. More than that, the recall metric is equally remarkable at 99.99%, showcasing the model’s effectiveness in identifying almost all actual positive cases. This high recall ensures that very few genuine positive instances are missed, which is crucial in sensitive applications like fraud detection or medical diagnostics. And the F1 score of 99.99% further validates the model’s outstanding performance by balancing precision and recall. This metric confirms that the model maintains exceptional performance across both identifying positive cases and maintaining high accuracy. Therefore, the misclassification rate is remarkably low at just 0.02%, signifying that the model makes very few errors in its predictions. This minimal error rate underscores the model’s reliability and precision in classification tasks. While most metrics are near-perfect, the specificity is slightly lower at 92.31%. This indicates that the model is somewhat less effective in correctly identifying negative cases, such as non-fraudulent events. Despite this relatively lower specificity, the overall performance remains exceptionally strong.

6.0 Conclusion

The Random Forest model has successfully addressed the objective of accurately classifying fraudulent and non-fraudulent transactions. Its high accuracy, precision, and recall demonstrate its potential as a reliable fraud detection tool. However, further validation on unseen datasets and in real-world scenarios is recommended to ensure its robustness and long-term effectiveness. In the regression model, the feature importantce shows the transfer, payment, cash out and debt contribute significantly to the likelihood of fraud.