Detecting Online Payment Fraud Using Machine Learning Models
Lecturer: Dr. Ang Group 9
| Matric | Full Name |
|---|---|
| 23121328 | Mohammed Iqram |
| 24052516 | LI JUNMING |
| 22106713 | LI YUEXIN |
| 23108677 | ZHAO ZITONG |
| 23111676 | LIU YICONG |
Fraud detection is an essential field of study due to the increasing prevalence of online transactions and digital payment systems. With billions of dollars lost annually to fraudulent activities, detecting and preventing fraud has become a top priority for organizations to protect financial resources and maintain customer trust. Fraudulent transactions often constitute a small portion of all activities, making detection a challenging task due to data imbalance and the constantly evolving nature of fraudulent patterns.
This study focuses on understanding the key features that contribute to the likelihood of fraud and developing a predictive model to assign risk scores to transactions. By identifying significant features and predicting the likelihood of fraud, businesses can prioritize high-risk cases for investigation, reducing false positives and improving detection efficiency.
# Load necessary library
library(readr)
## Warning: 程序包'readr'是用R版本4.4.2 来建造的
library(dplyr)
## Warning: 程序包'dplyr'是用R版本4.4.2 来建造的
##
## 载入程序包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(fastDummies)
## Warning: 程序包'fastDummies'是用R版本4.4.2 来建造的
library(digest)
## Warning: 程序包'digest'是用R版本4.4.2 来建造的
library(tidyr)
## Warning: 程序包'tidyr'是用R版本4.4.2 来建造的
library(ggplot2)
## Warning: 程序包'ggplot2'是用R版本4.4.2 来建造的
library(scales)
## Warning: 程序包'scales'是用R版本4.4.2 来建造的
##
## 载入程序包:'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(caret)
## Warning: 程序包'caret'是用R版本4.4.2 来建造的
## 载入需要的程序包:lattice
library(tidyverse)
## Warning: 程序包'tidyverse'是用R版本4.4.2 来建造的
## Warning: 程序包'tibble'是用R版本4.4.2 来建造的
## Warning: 程序包'purrr'是用R版本4.4.2 来建造的
## Warning: 程序包'stringr'是用R版本4.4.2 来建造的
## Warning: 程序包'forcats'是用R版本4.4.2 来建造的
## Warning: 程序包'lubridate'是用R版本4.4.2 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## Warning: 程序包'randomForest'是用R版本4.4.2 来建造的
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程序包:'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(pROC)
## Warning: 程序包'pROC'是用R版本4.4.2 来建造的
## Type 'citation("pROC")' for a citation.
##
## 载入程序包:'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(corrplot)
## Warning: 程序包'corrplot'是用R版本4.4.2 来建造的
## corrplot 0.95 loaded
library(reshape2)
## Warning: 程序包'reshape2'是用R版本4.4.2 来建造的
##
## 载入程序包:'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
rawdata <- read.csv('dirty_data.csv')
head(rawdata) # check first six rows of rawdata
## step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860 46042.29 M573487274
## oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 1 0
## 4 21182 0 1 0
## 5 0 0 0 0
## 6 0 0 0 0
Basic information about the dataset
Title: onlinefraud.csv
Year:2022
Purpose of dataset:Online paymentf Fraud detection by machine learning
Dimension:
Number of rows: 1,048,575
Number of columns: 11
Data size: 493.53 MB
Meaning of Column Names:
| Column Name | Description |
|---|---|
| step | Represents a unit of time where 1 step equals 1 hour |
| type | Type of online transaction |
| amount | The amount of the transaction |
| nameOrig | Customer starting the transaction |
| oldbalanceOrg | Balance before the transaction |
| newbalanceOrig | Balance after the transaction |
| nameDest | Recipient of the transaction |
| oldbalanceDest | Initial balance of recipient before the transaction |
| newbalanceDest | New balance of recipient after the transaction |
| isFraud | Indicates whether the transaction is fraudulent (1 = Fraud) |
# Check the dimension of the dataset
dim(rawdata)
## [1] 1048575 11
# Structure of the dataset
str(rawdata)
## 'data.frame': 1048575 obs. of 11 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : chr "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
# Summary of the dataset
summary(rawdata)
## step type amount nameOrig
## Min. : 1.00 Length:1048575 Min. : 0 Length:1048575
## 1st Qu.:15.00 Class :character 1st Qu.: 12149 Class :character
## Median :20.00 Mode :character Median : 76343 Mode :character
## Mean :26.97 Mean : 158667
## 3rd Qu.:39.00 3rd Qu.: 213762
## Max. :95.00 Max. :10000000
## NA's :6
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Min. : 0 Min. : 0 Length:1048575 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 Class :character 1st Qu.: 0
## Median : 16002 Median : 0 Mode :character Median : 126377
## Mean : 874010 Mean : 893809 Mean : 978160
## 3rd Qu.: 136642 3rd Qu.: 174600 3rd Qu.: 915923
## Max. :38900000 Max. :38900000 Max. :42100000
##
## newbalanceDest isFraud isFlaggedFraud
## Min. : 0 Min. :0.000000 Min. :0
## 1st Qu.: 0 1st Qu.:0.000000 1st Qu.:0
## Median : 218260 Median :0.000000 Median :0
## Mean : 1114198 Mean :0.001089 Mean :0
## 3rd Qu.: 1149808 3rd Qu.:0.000000 3rd Qu.:0
## Max. :42200000 Max. :1.000000 Max. :0
##
data <- rawdata # for data cleaning purpose
data <- rawdata %>% select(-"isFlaggedFraud") # df <- df %>% select(-B)
head(data)
## step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860 46042.29 M573487274
## oldbalanceDest newbalanceDest isFraud
## 1 0 0 0
## 2 0 0 0
## 3 0 0 1
## 4 21182 0 1
## 5 0 0 0
## 6 0 0 0
colSums(is.na(data))
## step type amount nameOrig oldbalanceOrg
## 0 0 6 0 0
## newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud
## 0 0 0 0 0
It shows that the Amount Column has 6 Null values
data$amount[is.na(data$amount)] <-
data$oldbalanceOrg[is.na(data$amount)] - data$newbalanceOrig[is.na(data$amount)]
The blanks are filled logically by ( amount = oldbalanceOrg - newbalanceOrig)
sum(is.na(data))
## [1] 0
There are no blank values
sum(duplicated(data))
## [1] 0
No duplicates found
str(data)
## 'data.frame': 1048575 obs. of 10 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : chr "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
data$type <- as.factor(data$type)
data$isFraud <- as.logical(data$isFraud)
data$amount <- as.numeric(data$amount)
data$oldbalanceOrg <- as.numeric(data$oldbalanceOrg)
data$newbalanceOrig <- as.numeric(data$newbalanceOrig)
data$oldbalanceDest <- as.numeric(data$oldbalanceDest)
data$newbalanceDest <- as.numeric(data$newbalanceDest)
summary(data)
## step type amount nameOrig
## Min. : 1.00 CASH_IN :227130 Min. : -227335 Length:1048575
## 1st Qu.:15.00 CASH_OUT:373641 1st Qu.: 12149 Class :character
## Median :20.00 DEBIT : 7178 Median : 76342 Mode :character
## Mean :26.97 PAYMENT :353873 Mean : 158666
## 3rd Qu.:39.00 TRANSFER: 86753 3rd Qu.: 213762
## Max. :95.00 Max. :10000000
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Min. : 0 Min. : 0 Length:1048575 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 Class :character 1st Qu.: 0
## Median : 16002 Median : 0 Mode :character Median : 126377
## Mean : 874010 Mean : 893809 Mean : 978160
## 3rd Qu.: 136642 3rd Qu.: 174600 3rd Qu.: 915923
## Max. :38900000 Max. :38900000 Max. :42100000
## newbalanceDest isFraud
## Min. : 0 Mode :logical
## 1st Qu.: 0 FALSE:1047433
## Median : 218260 TRUE :1142
## Mean : 1114198
## 3rd Qu.: 1149808
## Max. :42200000
boxplot(data$amount, main = "Boxplot for Amount", ylab = "Amount")
boxplot(data$oldbalanceOrg, main = "Boxplot for Old Balance Origin", ylab = "Old Balance Origin")
boxplot(data$newbalanceOrig, main = "Boxplot for New Balance Origin", ylab = "New Balance Origin")
boxplot(data$oldbalanceDest, main = "Boxplot for Old Balance Des", ylab = "Old Balance Des")
boxplot(data$newbalanceDest, main = "Boxplot for New Balance Des", ylab = "New Balance Des")
I am not removing data outliers as they represent valid financial transactions and removing them would lead to a loss of valuable information and potentially biased results.
# Identify inconsistent rows
inconsistent_rows <- data[data$newbalanceOrig > data$oldbalanceOrg |
data$newbalanceDest > data$oldbalanceDest, ]
cat("Inconsistent rows:", nrow(inconsistent_rows), "\n")
## Inconsistent rows: 661442
# Keep only consistent rows
data <- data[data$newbalanceOrig <= data$oldbalanceOrg &
data$newbalanceDest <= data$oldbalanceDest, ]
cat(all(data$newbalanceOrig <= data$oldbalanceOrg), "\n")
## TRUE
cat(all(data$newbalanceDest <= data$oldbalanceDest), "\n")
## TRUE
str(data)
## 'data.frame': 387133 obs. of 10 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : logi FALSE FALSE TRUE TRUE FALSE FALSE ...
# Keep only consistent rows:38,7133
unique(data$type)
## [1] PAYMENT TRANSFER CASH_OUT DEBIT CASH_IN
## Levels: CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
# unique(data$nameOrig)
# unique(data$nameDest)
Drop rows with type:“CASH_IN”, which is unexpected type to detect fraud
data <- data[data$type != "CASH_IN", ]
cat("Frequency table for 'type':\n")
## Frequency table for 'type':
print(table(data$type))
##
## CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
## 0 24731 1036 353873 3126
str(data)
## 'data.frame': 382766 obs. of 10 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : logi FALSE FALSE TRUE TRUE FALSE FALSE ...
after droped “CASH_IN”: 38,7133 observations >> 38,2766 observations
data <- dummy_cols(
data,
select_columns = c("type"),
remove_selected_columns = TRUE
)
Hashing Encoding for categorical variables (nameOrig, nameDest) to convert them into numeric format.
data$nameOrig <- as.integer(sapply(data$nameOrig, function(x) {
hash_value <- digest(x, algo = "xxhash32", seed = 123)
as.numeric(paste0("0x", hash_value)) %% 1000 #num_buckets could change based on model
}))
data$nameDest <- as.integer(sapply(data$nameDest, function(x) {
hash_value <- digest(x, algo = "xxhash32", seed = 123)
as.numeric(paste0("0x", hash_value)) %% 1000
}))
# Convert steps into days and create bins for different time periods
data <- data %>%
mutate(day = step %/% 24 + 1 )%>%
mutate(period = case_when(
step %% 24 >= 0 & step %% 24 < 6 ~ "Night",
step %% 24 >= 6 & step %% 24 < 12 ~ "Morning",
step %% 24 >= 12 & step %% 24 < 18 ~ "Afternoon",
TRUE ~ "Evening"
))
# Calculate ratio of transaction amount to the old balance and balance change
# For the originating account
data <- data %>%
mutate(ratio_orig = ifelse(oldbalanceOrg > 0, amount / oldbalanceOrg, 0))%>%
mutate(change_orig = oldbalanceOrg - newbalanceOrig)
# For the destination account
data <- data %>%
mutate(ratio_dest = ifelse(oldbalanceDest > 0, amount / oldbalanceDest, 0)) %>%
mutate(change_dest = oldbalanceDest - newbalanceDest)
# Difference between origin and destination balance change
data <- data %>%
mutate(change_diff = change_orig - change_dest)
head(data)
## step amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1 1 9839.64 498 170136 160296.36 563 0
## 2 1 1864.28 727 21249 19384.72 613 0
## 3 1 181.00 293 181 0.00 124 0
## 4 1 181.00 977 181 0.00 92 21182
## 5 1 11668.14 202 41554 29885.86 10 0
## 6 1 7817.71 259 53860 46042.29 721 0
## newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1 0 FALSE 0 0 0 1
## 2 0 FALSE 0 0 0 1
## 3 0 TRUE 0 0 0 0
## 4 0 TRUE 0 1 0 0
## 5 0 FALSE 0 0 0 1
## 6 0 FALSE 0 0 0 1
## type_TRANSFER day period ratio_orig change_orig ratio_dest change_dest
## 1 0 1 Night 0.05783397 9839.64 0.000000000 0
## 2 0 1 Night 0.08773495 1864.28 0.000000000 0
## 3 1 1 Night 1.00000000 181.00 0.000000000 0
## 4 0 1 Night 1.00000000 181.00 0.008544991 21182
## 5 0 1 Night 0.28079463 11668.14 0.000000000 0
## 6 0 1 Night 0.14514872 7817.71 0.000000000 0
## change_diff
## 1 9839.64
## 2 1864.28
## 3 181.00
## 4 -21001.00
## 5 11668.14
## 6 7817.71
options(scipen = 999)
plot_data <- data %>%
group_by(isFraud) %>%
summarise(
CASH_OUT = sum(type_CASH_OUT),
DEBIT = sum(type_DEBIT),
PAYMENT = sum(type_PAYMENT),
TRANSFER = sum(type_TRANSFER)
)
plot_data_long <- plot_data %>%
pivot_longer(
cols = c(CASH_OUT, DEBIT, PAYMENT, TRANSFER),
names_to = "Transaction_Type",
values_to = "Value"
)
plot_data_long$isFraud <- as.factor(plot_data_long$isFraud)
ggplot(plot_data_long, aes(x = Transaction_Type, y = Value, fill = isFraud)) +
geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8), color = "black") +
geom_text(aes(label = Value),
vjust = -0.5,
size = 3,
position = position_dodge(width = 0.8),
color = "black") +
scale_y_continuous(trans = "identity") +
labs(
title = "Count Plot of Transaction Types by Fraud Status",
x = "Transaction Type",
y = "Number of Transactions",
fill = "Fraud Status"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.position = "right"
) +
scale_fill_manual(
values = c("FALSE" = "skyblue", "TRUE" = "#FF6666"),
labels = c("Non-Fraudulent", "Fraudulent")
)
cols_of_interest <- c("amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")
data_filtered <- data[cols_of_interest]
for (col_name in cols_of_interest) {
log_col_name <- paste0("Log_", col_name)
data_filtered[[log_col_name]] <- log(data_filtered[[col_name]] + 1)
p <- ggplot(data_filtered, aes(x = .data[[log_col_name]])) +
geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of Log-transformed", col_name),
x = paste("Log(", col_name, " + 1)", sep = ""),
y = "Frequency") +
theme_minimal()
print(p)
}
num_cols <- sapply(data, is.numeric)
cor_matrix <- cor(data[, num_cols], use = "complete.obs")
## Warning in cor(data[, num_cols], use = "complete.obs"): 标准差为零
cor_matrix <- cor_matrix[rownames(cor_matrix) != "type_CASH_IN", colnames(cor_matrix) != "type_CASH_IN"]
dist_matrix <- as.dist(1 - cor_matrix)
if (any(is.na(dist_matrix))) {
stop("Distance matrix contains NA values.")
}
if (any(dist_matrix < 0, na.rm = TRUE)) {
stop("Distance matrix contains negative values.")
}
hclust_order <- hclust(dist_matrix)$order
cor_matrix <- cor_matrix[hclust_order, hclust_order]
library(corrplot)
corrplot(
cor_matrix,
method = "circle",
type = "full",
tl.cex = 0.8,
cl.cex = 0.8,
tl.col = "black",
addrect = 2
)
head(data)
## step amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1 1 9839.64 498 170136 160296.36 563 0
## 2 1 1864.28 727 21249 19384.72 613 0
## 3 1 181.00 293 181 0.00 124 0
## 4 1 181.00 977 181 0.00 92 21182
## 5 1 11668.14 202 41554 29885.86 10 0
## 6 1 7817.71 259 53860 46042.29 721 0
## newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1 0 FALSE 0 0 0 1
## 2 0 FALSE 0 0 0 1
## 3 0 TRUE 0 0 0 0
## 4 0 TRUE 0 1 0 0
## 5 0 FALSE 0 0 0 1
## 6 0 FALSE 0 0 0 1
## type_TRANSFER day period ratio_orig change_orig ratio_dest change_dest
## 1 0 1 Night 0.05783397 9839.64 0.000000000 0
## 2 0 1 Night 0.08773495 1864.28 0.000000000 0
## 3 1 1 Night 1.00000000 181.00 0.000000000 0
## 4 0 1 Night 1.00000000 181.00 0.008544991 21182
## 5 0 1 Night 0.28079463 11668.14 0.000000000 0
## 6 0 1 Night 0.14514872 7817.71 0.000000000 0
## change_diff
## 1 9839.64
## 2 1864.28
## 3 181.00
## 4 -21001.00
## 5 11668.14
## 6 7817.71
Investigate consistency of balances. First scatter plot help to understand the relationship between the sender’s original and new balances after transactions, also helps detect unusual or unexpected patterns, like abrupt zero balances or discrepancies. Second scatter plot aims to study how transactions impact the recipient’s account balance. And helps spot anomalies where balances remain unchanged (potential red flags for fraud or simulation.
# Scatter plot for oldbalanceOrg vs newbalanceOrig
plotorg <- ggplot(data, aes(x = oldbalanceOrg, y = newbalanceOrig)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Scatter Plot: oldbalanceOrg vs newbalanceOrig",
x = "Old Balance Origin",
y = "New Balance Origin") +
theme_minimal()
print(plotorg)
# Scatter plot for oldbalanceDest vs newbalanceDest
plotdest <- ggplot(data, aes(x = oldbalanceDest, y = newbalanceDest)) +
geom_point(alpha = 0.5, color = "lightgreen") +
labs(title = "Scatter Plot: oldbalanceDest vs newbalanceDest",
x = "Initial balance of recipient before the transaction",
y = "New balance of recipient after the transaction") +
theme_minimal()
# Print the second plot
print(plotdest)
To investigate fraud patterns by comparing amounts and balance changes for fraudulent and non-fraudulent transactions.
# Violin plot for Transaction Amount by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = amount, fill = as.factor(isFraud))) +
geom_violin(trim = FALSE) +
scale_y_log10() + # Use log scale for better visualization if amounts vary widely
labs(title = "Transaction Amount Distribution by Fraud Status",
x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
y = "Transaction Amount (log scale)") +
theme_minimal() +
theme(legend.position = "none")
The distribution of transaction amounts for both fraudulent and non-fraudulent transactions. The shape of the violin indicates the density (frequency) of transactions at various amount levels. Besides, log scale is used to better visualize the wide range of transaction amounts. If the fraudulent distribution is skewed towards higher amounts compared to non-fraudulent transactions, it suggests fraud often involves larger sums of money. And peaks (wider sections) indicate common transaction amounts. Other side, narrow tails in the non-fraudulent category might indicate a consistent range of typical transaction amounts, while broader tails for fraud might highlight its unpredictability.
# Violin plot for Balance Changes (oldbalanceOrg - newbalanceOrig) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = oldbalanceOrg - newbalanceOrig, fill = as.factor(isFraud))) +
geom_violin(trim = FALSE) +
labs(title = "Origin Balance Change Distribution by Fraud Status",
x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
y = "Balance Change (oldbalanceOrg - newbalanceOrig)") +
theme_minimal() +
theme(legend.position = "none")
Non-Fraudulent Transactions (FALSE)
The distribution is very narrow, with most balance changes concentrated near small values (close to 0). This indicates that for non-fraudulent transactions, the sender’s balance changes are typically modest and predictable. The balance change pattern is consistent with legitimate transactions, where the deducted amount aligns with normal activities.
Fraudulent Transactions (TRUE)
The distribution is much broader, and balance changes can be extremely large, reaching values in the millions. This suggests that fraudulent transactions often involve significant deductions from the sender’s account. The sharp contrast in balance change magnitude highlights that fraud frequently involves outliers (large, irregular transactions).
# Violin plot for Balance Changes (newbalanceDest - oldbalanceDest) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = newbalanceDest - oldbalanceDest, fill = as.factor(isFraud))) +
geom_violin(trim = FALSE) +
labs(title = "Destination Balance Change Distribution by Fraud Status",
x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
y = "Balance Change (newbalanceDest - oldbalanceDest)") +
theme_minimal() +
theme(legend.position = "none")
Non-Fraudulent Transactions (FALSE)
The distribution is very narrow, with balance changes clustered near zero. This indicates that in most legitimate transactions, recipients’ balances either increase modestly or stay consistent with expected transaction behavior. The low variability suggests predictable and consistent crediting in legitimate transactions.
Fraudulent Transactions (TRUE)
The distribution for fraudulent transactions also exhibits a narrow range, but there are significant outliers. These outliers indicate: Large negative balance changes, where credits do not align with the deducted amounts from the sender. Inconsistent or unusual behavior, possibly due to incomplete or simulated transactions.
# Split data sets
data_rf <- data
data_rf$isFraud <- as.factor(data_rf$isFraud)
levels(data_rf$isFraud) <- c("NonFraud", "Fraud")
set.seed(123)
train_index <- createDataPartition(data_rf$isFraud, p = 0.7, list = FALSE)
train_data <- data_rf[train_index, ]
test_data <- data_rf[-train_index, ]
#Building model
n_features <- ncol(data_rf) - 1
# Setting up cross-validation controls for random searches
train_control_random <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Random search
set.seed(123)
rf_random_search <- train(
isFraud ~ .,
data = train_data,
method = "rf",
metric = "ROC",
tuneLength = 10,
trControl = train_control_random,
ntree = 100
)
print(rf_random_search)
## Random Forest
##
## 267937 samples
## 20 predictor
## 2 classes: 'NonFraud', 'Fraud'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9999890 0.9999925 0.9417722
## 4 0.9999973 0.9999888 0.9544304
## 6 0.9999956 0.9999888 0.9518987
## 8 0.9999950 0.9999850 0.9518987
## 10 0.9999939 0.9999888 0.9518987
## 13 0.9999717 0.9999888 0.9518987
## 15 0.9949039 0.9999888 0.9518987
## 17 0.9949044 0.9999888 0.9518987
## 19 0.9873092 0.9999888 0.9518987
## 22 0.9885763 0.9999888 0.9518987
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
best_mtry <- rf_random_search$bestTune$mtry
cat("Best mtry from Random Search:", best_mtry, "\n")
## Best mtry from Random Search: 4
Due to the large amount of data, we first used random search to narrow down the tuning parameters. A grid search was then used to determine the exact ground parameters. This reduces runtime and improves efficiency.
# Grid Search
# Narrow down mtry based on random search results
rf_grid <- expand.grid(
mtry = seq(max(1, best_mtry - 2), best_mtry + 2, by = 1) #Make sure mtry is not less than 1
)
# Setting up cross-validation controls for grid searches
train_control_grid <- trainControl(
method = "cv", # cross-validation
number = 5, # 5-fold
classProbs = TRUE,
summaryFunction = twoClassSummary # Use of ROC as an assessment indicator
)
# Grid Search
set.seed(123)
rf_grid_search <- train(
isFraud ~ .,
data = train_data,
method = "rf",
metric = "ROC",
tuneGrid = rf_grid,
trControl = train_control_grid,
ntree = 100
)
print(rf_grid_search)
## Random Forest
##
## 267937 samples
## 20 predictor
## 2 classes: 'NonFraud', 'Fraud'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9999898 0.9999925 0.9518987
## 3 0.9999943 0.9999888 0.9518987
## 4 0.9999965 0.9999888 0.9518987
## 5 0.9999958 0.9999888 0.9544304
## 6 0.9999974 0.9999850 0.9569620
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
cat("Best mtry from Grid Search:", rf_grid_search$bestTune$mtry, "\n")
## Best mtry from Grid Search: 6
# Model Evaluation
# Evaluate the final model on the test set
final_model <- rf_grid_search$finalModel
test_predictions <- predict(rf_grid_search, newdata = test_data, type = "prob")[, 2]
# Calculate AUC
test_roc <- roc(test_data$isFraud, test_predictions)
## Setting levels: control = NonFraud, case = Fraud
## Setting direction: controls < cases
cat("Test AUC:", auc(test_roc), "\n")
## Test AUC: 0.9970317
# Plotting ROC curves
plot(test_roc, main = "ROC Curve for Final Model")
As can be seen from the provided ROC curve graphs, the model’s classification performance is very strong. The curve is close to the upper left corner, indicating that the model performs well in distinguishing between positive (fraudulent) and negative (non-fraudulent) samples with high Sensitivity and Specificity. In addition, the area under the ROC curve (AUC) is expected to be close to 1, which further validates that the model’s classification ability is almost optimal under different thresholds. However, this near-perfect performance may require vigilance against the possibility of overfitting, especially if the test data is too similar to the training data distribution or there is a data leakage problem. A combination of confusion matrices, classification metrics (e.g., precision and recall), and further examination of the data distribution is needed to ensure the robustness and generalisation ability of the model.
# Confusion Matrix
test_predictions_class <- predict(rf_grid_search, newdata = test_data, type = "raw")
confusion_matrix <- confusionMatrix(test_predictions_class, test_data$isFraud)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NonFraud Fraud
## NonFraud 114659 2
## Fraud 1 167
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9985
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9911
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9882
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9940
## Prevalence : 0.9985
## Detection Rate : 0.9985
## Detection Prevalence : 0.9985
## Balanced Accuracy : 0.9941
##
## 'Positive' Class : NonFraud
##
From the confusion matrix and statistical metrics, the Random Forest model performed extremely well in the classification task. The confusion matrix shows that the model had very few misclassifications in the test set, with 2 predictions in the NonFraud category being misclassified as Fraud and only 1 prediction in the Fraud category being misclassified as NonFraud. the overall Accuracy of the model is 1, with an almost perfect 95% confidence interval (0.9999, 1). The Sensitivity of 1 indicates that the model correctly detects all positive samples (NonFraud), while the Specificity of 0.9882 indicates that the model is also very accurate in classifying negative samples (Fraud). The Positive Pred Value (Pos Pred Value) of 1 indicates that all samples predicted as NonFraud are true NonFraud, while the Negative Pred Value (Neg Pred Value) of 0.994 indicates that most of the samples predicted as Fraud are also accurate.
data_logit <- data
data_logit$isFraud <- as.factor(data_logit$isFraud)
levels(data_logit$isFraud) <- c("NonFraud", "Fraud")
set.seed(123)
train_index <- createDataPartition(data_logit$isFraud, p = 0.7, list = FALSE)
train_data <- data_logit[train_index, ]
test_data <- data_logit[-train_index, ]
logit_model <- glm(isFraud ~ ., data = train_data, family = binomial, control = list(maxit = 50))
## Warning: glm.fit:拟合概率算出来是数值零或一
summary(logit_model)
##
## Call:
## glm(formula = isFraud ~ ., family = binomial, data = train_data,
## control = list(maxit = 50))
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error
## (Intercept) -3042866434641853837006482.000 46921597419489696.000
## step 4110946821902.652 84405.344
## amount -1732453559.117 3.742
## nameOrig 12567222501.885 448.652
## oldbalanceOrg 3305655614.222 3.872
## newbalanceOrig -6655561949.567 3.954
## nameDest -5066725870.390 449.323
## oldbalanceDest -1631068099.140 2.150
## newbalanceDest 1453817041.490 2.149
## type_CASH_IN NA NA
## type_CASH_OUT 3042866434306248313600224.000 46921597419361032.000
## type_DEBIT 3042866433682500413866662.000 46921597419363176.000
## type_PAYMENT 3042866434390436115200408.000 46921597419363704.000
## type_TRANSFER 3042866434561473524068842.000 46921597419325040.000
## day -64363358160846.922 2066016.713
## periodEvening 16153548717839.422 546064.555
## periodMorning 46468472513322.094 533547.603
## periodNight 134108480308588.109 1425457.084
## ratio_orig -4340634001.569 177.209
## change_orig NA NA
## ratio_dest -111963329231.272 774.756
## change_dest NA NA
## change_diff NA NA
## z value Pr(>|z|)
## (Intercept) -64850018 <0.0000000000000002 ***
## step 48704817 <0.0000000000000002 ***
## amount -462967637 <0.0000000000000002 ***
## nameOrig 28011071 <0.0000000000000002 ***
## oldbalanceOrg 853748967 <0.0000000000000002 ***
## newbalanceOrig -1683415101 <0.0000000000000002 ***
## nameDest -11276365 <0.0000000000000002 ***
## oldbalanceDest -758741709 <0.0000000000000002 ***
## newbalanceDest 676479802 <0.0000000000000002 ***
## type_CASH_IN NA NA
## type_CASH_OUT 64850018 <0.0000000000000002 ***
## type_DEBIT 64850018 <0.0000000000000002 ***
## type_PAYMENT 64850018 <0.0000000000000002 ***
## type_TRANSFER 64850018 <0.0000000000000002 ***
## day -31153358 <0.0000000000000002 ***
## periodEvening 29581757 <0.0000000000000002 ***
## periodMorning 87093396 <0.0000000000000002 ***
## periodNight 94081037 <0.0000000000000002 ***
## ratio_orig -24494500 <0.0000000000000002 ***
## change_orig NA NA
## ratio_dest -144514367 <0.0000000000000002 ***
## change_dest NA NA
## change_diff NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5939.9 on 267936 degrees of freedom
## Residual deviance: 6487.9 on 267918 degrees of freedom
## AIC: 6525.9
##
## Number of Fisher Scoring iterations: 27
test_data$predicted_prob <- predict(logit_model, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, "Fraud", "NonFraud")
test_data$predicted_class <- factor(test_data$predicted_class, levels = levels(data_logit$isFraud))
conf_matrix <- confusionMatrix(
test_data$predicted_class,
test_data$isFraud
)
conf_matrix_table <- as.table(conf_matrix)
conf_matrix_df <- as.data.frame(conf_matrix_table)
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)
misclassification_rate <- 1 - accuracy
specificity <- conf_matrix$byClass['Specificity']
metrics_df <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1 Score", "Misclassification Rate", "Specificity"),
Value = c(accuracy, precision, recall, f1_score, misclassification_rate, specificity)
)
print(metrics_df)
## Metric Value
## 1 Accuracy 0.9997561592
## 2 Precision 0.9998866193
## 3 Recall 0.9998691784
## 4 F1 Score 0.9998778988
## 5 Misclassification Rate 0.0002438408
## 6 Specificity 0.9230769231
mcnemar_test <- mcnemar.test(conf_matrix$table)
cat("McNemar's Test P-Value:", mcnemar_test$p.value, "\n")
## McNemar's Test P-Value: 0.8501067
roc_curve <- roc(test_data$isFraud, test_data$predicted_prob)
## Setting levels: control = NonFraud, case = Fraud
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
## AUC: 0.9614731
plot(roc_curve, main = "ROC Curve", col = "blue")
feature_importance <- summary(logit_model)$coefficients
importance_df <- data.frame(
Feature = rownames(feature_importance),
Coefficient = feature_importance[, 1],
PValue = feature_importance[, 4]
)
importance_df <- importance_df[!grepl("Intercept", importance_df$Feature), ]
importance_df <- na.omit(importance_df)
feature_summary_df <- importance_df %>%
mutate(Var1 = ifelse(Coefficient > 0, "Positive", "Negative")) %>%
count(Var1) %>%
rename(Freq = n) %>%
mutate(Percentage = Freq / sum(Freq) * 100)
ggplot(feature_summary_df, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar(theta = "y") +
labs(title = "Distribution of Feature Coefficients (Positive vs Negative)", x = NULL, y = NULL) +
theme_void() +
scale_fill_manual(values = c("skyblue", "lightcoral")) +
geom_text(aes(label = paste(Var1, "\n", round(Percentage, 1), "%")), position = position_stack(vjust = 0.5), size = 5, color = "white") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(importance_df, aes(x = reorder(Feature, Coefficient), y = Coefficient, fill = Coefficient > 0)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("skyblue", "lightcoral")) +
labs(title = "Feature Importance (Coefficients) from Logistic Regression",
x = "Features",
y = "Coefficient Value") +
theme_minimal() +
theme(axis.text.y = element_text(size = 10),
plot.title = element_text(hjust = 0.5))
test_data$risk_score <- test_data$predicted_prob
test_data$risk_category <- ifelse(test_data$predicted_prob > 0.7, "High Risk",
ifelse(test_data$predicted_prob > 0.3, "Medium Risk", "Low Risk"))
head(test_data[, c("predicted_prob", "risk_score", "risk_category")])
## predicted_prob risk_score risk_category
## 1 0.0000000000000002220446 0.0000000000000002220446 Low Risk
## 2 0.0000000000000002220446 0.0000000000000002220446 Low Risk
## 5 0.0000000000000002220446 0.0000000000000002220446 Low Risk
## 16 0.0000000000000002220446 0.0000000000000002220446 Low Risk
## 19 0.0000000000000002220446 0.0000000000000002220446 Low Risk
## 30 0.0000000000000002220446 0.0000000000000002220446 Low Risk
ggplot(test_data, aes(x = risk_category)) +
geom_bar(fill = "skyblue") +
labs(title = "Distribution of Risk Categories", x = "Risk Category", y = "Count") +
theme_minimal()
ggplot(conf_matrix_df, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "blue", name = "Frequency") +
geom_text(aes(label = Freq), color = "white", size = 5) +
labs(
title = "Confusion Matrix Visualization",
x = "Predicted Class",
y = "Actual Class"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
The model demonstrated exceptional performance, achieving an overall accuracy of 1 with a sensitivity of 1.0 and specificity of 0.9882, as reflected in the confusion matrix. These results indicate that the model correctly identified almost all non-fraudulent and fraudulent transactions, with only three misclassifications in total. The positive predictive value (precision) of 1.0 for the non-fraudulent class further demonstrates the model’s reliability in avoiding false positives, while a Kappa score of 0.9911 confirms a near-perfect agreement between predictions and actual labels. These metrics strongly suggest that the Random Forest model not only meets but exceeds the objective, making it a highly effective tool for fraud detection.
Despite its strong performance, several considerations need to be addressed to validate the model’s robustness. Fraudulent transactions are typically rare, raising concerns about potential overfitting or reliance on majority class patterns, even with strong metrics. Additionally, the almost perfect results may indicate a potential similarity between the training and testing datasets, or even a risk of data leakage, which should be ruled out by further testing on external datasets. To ensure broader applicability, the model should also be tested on more imbalanced and diverse datasets to assess its generalization capability. Beyond its predictive accuracy, the Random Forest model provides interpretability through feature importance analysis, offering actionable insights into the key drivers of fraud. These insights can be invaluable for organizations to strengthen fraud prevention strategies and optimize their operations.
In regression model, accuracy stands at an impressive 99.98%, demonstrating that the model is highly accurate and correctly predicts outcomes for nearly all cases. This near-perfect accuracy suggests robust performance across various scenarios. Besides, precision reaches an extraordinary 99.99%, indicating exceptional reliability when the model predicts a positive class, such as in fraud detection. This means that when the model identifies a positive instance, it is almost always correct, minimizing false positive errors. More than that, the recall metric is equally remarkable at 99.99%, showcasing the model’s effectiveness in identifying almost all actual positive cases. This high recall ensures that very few genuine positive instances are missed, which is crucial in sensitive applications like fraud detection or medical diagnostics. And the F1 score of 99.99% further validates the model’s outstanding performance by balancing precision and recall. This metric confirms that the model maintains exceptional performance across both identifying positive cases and maintaining high accuracy. Therefore, the misclassification rate is remarkably low at just 0.02%, signifying that the model makes very few errors in its predictions. This minimal error rate underscores the model’s reliability and precision in classification tasks. While most metrics are near-perfect, the specificity is slightly lower at 92.31%. This indicates that the model is somewhat less effective in correctly identifying negative cases, such as non-fraudulent events. Despite this relatively lower specificity, the overall performance remains exceptionally strong.
The Random Forest model has successfully addressed the objective of accurately classifying fraudulent and non-fraudulent transactions. Its high accuracy, precision, and recall demonstrate its potential as a reliable fraud detection tool. However, further validation on unseen datasets and in real-world scenarios is recommended to ensure its robustness and long-term effectiveness. In the regression model, the feature importantce shows the transfer, payment, cash out and debt contribute significantly to the likelihood of fraud.