Detecting Online Payment Fraud Using Machine Learning Models
Lecturer: Dr. Ang Group 9
| Matric | Full Name |
|---|---|
| 23121328 | Mohammed Iqram |
| 24052516 | LI JUNMING |
| 22106713 | LI YUEXIN |
| 23111676 | LIU YICONG |
| 23108677 | ZHAO ZITONG |
Fraud detection has become highly important in today’s world, when most of us depend on digital payments and online transactions. Many people buy and sell many things in the virtual world, which increases the chances of fraud. Losses in the virtual world due to fraud now reach billions of dollars each year. This affects the bottom line of businesses and reduces customer confidence and trust. Hence, companies work hard and Pursue all possibilities to prevent fraud so that all types of service users remain safe.
There are significantly more genuine transactions than fraudulent ones, resulting in an imbalance that makes it difficult to identify the fraudulent transactions. This is one of the main issues with fraud detection. Additionally, fraudsters are always changing strategies, so systems must get smarter in order to recognize them. Therefore, it will be crucial to identify the patterns or indications that point to fraud and utilize machine learning to develop predictive systems that can aid in its detection and help in prevention.
This study aims to identify the underlying patterns of fraudulent activities and build a machine learning model that is capable of accurately classifying transactions as fraudulent or non-fraudulent.
Question 1: How can machine learning models be optimized to achieve the highest accuracy in classifying financial transactions as fraudulent or non-fraudulent?
Objective 1: To accurately classify a transaction as fraudulent or non-fraudulent using machine learning models.
Proposed Model: Random Forest & XgBoost
Question 2: What are the key factors influencing the amount of fraudulent transactions?
Objective 2: To determine the key factors influencing the amount of fraudulent transactions.
Proposed Model: Lasso Regression, Linear Regression, and Polynomial Regression
rawdata <- read.csv('dirty_data.csv') # here we are importing the dataset
head(rawdata) # here checking first six rows of rawdata
## step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860 46042.29 M573487274
## oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 1 0
## 4 21182 0 1 0
## 5 0 0 0 0
## 6 0 0 0 0
Year: 2022
Purpose of Dataset: Online payment fraud detection by
machine learning
Dimension:
- Number of rows: 1,048,575
- Number of columns: 11
Data Size: 493.53 MB
Dataset Link: Online Payment Fraud Detection on Kaggle
| Column Name | Description |
|---|---|
| step | Represents a unit of time where 1 step equals 1 hour |
| type | Type of online transaction |
| amount | The amount of the transaction |
| nameOrig | Customer starting the transaction |
| oldbalanceOrg | Balance before the transaction |
| newbalanceOrig | Balance after the transaction |
| nameDest | Recipient of the transaction |
| oldbalanceDest | Initial balance of recipient before the transaction |
| newbalanceDest | New balance of recipient after the transaction |
| isFraud | Indicates whether the transaction is fraudulent (1 = Fraud) |
# Checking the dimension of the dataset
dim(rawdata)
## [1] 1048575 11
# knowing the structure of the dataset
str(rawdata)
## 'data.frame': 1048575 obs. of 11 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : chr "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
# knowing the summary of the dataset
summary(rawdata)
## step type amount nameOrig
## Min. : 1.00 Length:1048575 Min. : 0 Length:1048575
## 1st Qu.:15.00 Class :character 1st Qu.: 12149 Class :character
## Median :20.00 Mode :character Median : 76343 Mode :character
## Mean :26.97 Mean : 158667
## 3rd Qu.:39.00 3rd Qu.: 213762
## Max. :95.00 Max. :10000000
## NA's :6
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Min. : 0 Min. : 0 Length:1048575 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 Class :character 1st Qu.: 0
## Median : 16002 Median : 0 Mode :character Median : 126377
## Mean : 874010 Mean : 893809 Mean : 978160
## 3rd Qu.: 136642 3rd Qu.: 174600 3rd Qu.: 915923
## Max. :38900000 Max. :38900000 Max. :42100000
##
## newbalanceDest isFraud isFlaggedFraud
## Min. : 0 Min. :0.000000 Min. :0
## 1st Qu.: 0 1st Qu.:0.000000 1st Qu.:0
## Median : 218260 Median :0.000000 Median :0
## Mean : 1114198 Mean :0.001089 Mean :0
## 3rd Qu.: 1149808 3rd Qu.:0.000000 3rd Qu.:0
## Max. :42200000 Max. :1.000000 Max. :0
##
data <- rawdata # for data cleaning purpose
data <- rawdata %>% select(-"isFlaggedFraud") # df <- df %>% select(-B)
head(data)
## step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest
## 1 1 PAYMENT 9839.64 C1231006815 170136 160296.36 M1979787155
## 2 1 PAYMENT 1864.28 C1666544295 21249 19384.72 M2044282225
## 3 1 TRANSFER 181.00 C1305486145 181 0.00 C553264065
## 4 1 CASH_OUT 181.00 C840083671 181 0.00 C38997010
## 5 1 PAYMENT 11668.14 C2048537720 41554 29885.86 M1230701703
## 6 1 PAYMENT 7817.71 C90045638 53860 46042.29 M573487274
## oldbalanceDest newbalanceDest isFraud
## 1 0 0 0
## 2 0 0 0
## 3 0 0 1
## 4 21182 0 1
## 5 0 0 0
## 6 0 0 0
colSums(is.na(data))
## step type amount nameOrig oldbalanceOrg
## 0 0 6 0 0
## newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud
## 0 0 0 0 0
It shows that the Amount Column has 6 Null values
data$amount[is.na(data$amount)] <-
data$oldbalanceOrg[is.na(data$amount)] - data$newbalanceOrig[is.na(data$amount)]
To fill missing values in the amount column, use the logical relationship:
amount=oldbalanceOrg−newbalanceOrig
If amount is missing, it can be computed by subtracting newbalanceOrig from oldbalanceOrg to maintain consistency in the data.
sum(is.na(data))
## [1] 0
There are no blank values
sum(duplicated(data))
## [1] 0
No duplicates found
str(data)
## 'data.frame': 1048575 obs. of 10 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : chr "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
data$type <- as.factor(data$type)
data$isFraud <- as.logical(data$isFraud)
data$amount <- as.numeric(data$amount)
data$oldbalanceOrg <- as.numeric(data$oldbalanceOrg)
data$newbalanceOrig <- as.numeric(data$newbalanceOrig)
data$oldbalanceDest <- as.numeric(data$oldbalanceDest)
data$newbalanceDest <- as.numeric(data$newbalanceDest)
summary(data)
## step type amount nameOrig
## Min. : 1.00 CASH_IN :227130 Min. : -227335 Length:1048575
## 1st Qu.:15.00 CASH_OUT:373641 1st Qu.: 12149 Class :character
## Median :20.00 DEBIT : 7178 Median : 76342 Mode :character
## Mean :26.97 PAYMENT :353873 Mean : 158666
## 3rd Qu.:39.00 TRANSFER: 86753 3rd Qu.: 213762
## Max. :95.00 Max. :10000000
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Min. : 0 Min. : 0 Length:1048575 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 Class :character 1st Qu.: 0
## Median : 16002 Median : 0 Mode :character Median : 126377
## Mean : 874010 Mean : 893809 Mean : 978160
## 3rd Qu.: 136642 3rd Qu.: 174600 3rd Qu.: 915923
## Max. :38900000 Max. :38900000 Max. :42100000
## newbalanceDest isFraud
## Min. : 0 Mode :logical
## 1st Qu.: 0 FALSE:1047433
## Median : 218260 TRUE :1142
## Mean : 1114198
## 3rd Qu.: 1149808
## Max. :42200000
boxplot(data$amount, main = "Boxplot for Amount", ylab = "Amount")
boxplot(data$oldbalanceOrg, main = "Boxplot for Old Balance Origin", ylab = "Old Balance Origin")
boxplot(data$newbalanceOrig, main = "Boxplot for New Balance Origin", ylab = "New Balance Origin")
boxplot(data$oldbalanceDest, main = "Boxplot for Old Balance Des", ylab = "Old Balance Des")
boxplot(data$newbalanceDest, main = "Boxplot for New Balance Des", ylab = "New Balance Des")
From the given boxplots, some variables like amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, and newbalanceDest still have outliers. I haven’t removed them because those are unusual but valid transactions. In financial data, especially fraud detection, outliers are quite common and can be for potentially fraudulent or high-value transactions. Therefore, retaining these outliers will result in better model accuracy and insight.
# Identify inconsistent rows
inconsistent_rows <- data[data$newbalanceOrig > data$oldbalanceOrg |
data$newbalanceDest > data$oldbalanceDest, ]
cat("Inconsistent rows:", nrow(inconsistent_rows), "\n")
## Inconsistent rows: 661442
# Keep only consistent rows
data <- data[data$newbalanceOrig <= data$oldbalanceOrg &
data$newbalanceDest <= data$oldbalanceDest, ]
cat(all(data$newbalanceOrig <= data$oldbalanceOrg), "\n")
## TRUE
cat(all(data$newbalanceDest <= data$oldbalanceDest), "\n")
## TRUE
str(data)
## 'data.frame': 387133 obs. of 10 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : logi FALSE FALSE TRUE TRUE FALSE FALSE ...
# Keep only consistent rows:38,7133
unique(data$type)
## [1] PAYMENT TRANSFER CASH_OUT DEBIT CASH_IN
## Levels: CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
# unique(data$nameOrig)
# unique(data$nameDest)
Drop rows with type:“CASH_IN”, which is unexpected type to detect fraud
data <- data[data$type != "CASH_IN", ]
cat("Frequency table for 'type':\n")
## Frequency table for 'type':
print(table(data$type))
##
## CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
## 0 24731 1036 353873 3126
str(data)
## 'data.frame': 382766 obs. of 10 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : chr "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : chr "M1979787155" "M2044282225" "C553264065" "C38997010" ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : logi FALSE FALSE TRUE TRUE FALSE FALSE ...
after droped “CASH_IN”: 38,7133 observations >> 38,2766 observations
colnames(data)
## [1] "step" "type" "amount" "nameOrig"
## [5] "oldbalanceOrg" "newbalanceOrig" "nameDest" "oldbalanceDest"
## [9] "newbalanceDest" "isFraud"
data <- dummy_cols(
data,
select_columns = c("type"),
remove_selected_columns = TRUE
)
Hashing Encoding for categorical variables (nameOrig, nameDest) to convert them into numeric format.
data$nameOrig <- as.integer(sapply(data$nameOrig, function(x) {
hash_value <- digest(x, algo = "xxhash32", seed = 123)
as.numeric(paste0("0x", hash_value)) %% 1000 #num_buckets could change based on model
}))
data$nameDest <- as.integer(sapply(data$nameDest, function(x) {
hash_value <- digest(x, algo = "xxhash32", seed = 123)
as.numeric(paste0("0x", hash_value)) %% 1000
}))
# Convert steps into days and create bins for different time periods
data <- data %>%
mutate(day = step %/% 24 + 1 )%>%
mutate(period = case_when(
step %% 24 >= 0 & step %% 24 < 6 ~ "Night",
step %% 24 >= 6 & step %% 24 < 12 ~ "Morning",
step %% 24 >= 12 & step %% 24 < 18 ~ "Afternoon",
TRUE ~ "Evening"
))
# Calculate ratio of transaction amount to the old balance and balance change
# For the originating account
data <- data %>%
mutate(ratio_orig = ifelse(oldbalanceOrg > 0, amount / oldbalanceOrg, 0))%>%
mutate(change_orig = oldbalanceOrg - newbalanceOrig)
# For the destination account
data <- data %>%
mutate(ratio_dest = ifelse(oldbalanceDest > 0, amount / oldbalanceDest, 0)) %>%
mutate(change_dest = oldbalanceDest - newbalanceDest)
# Difference between origin and destination balance change
data <- data %>%
mutate(change_diff = change_orig - change_dest)
we converted steps into days and created time period bins (Night, Morning, Afternoon, Evening) based on the step value. We then calculated the ratio of transaction amount to the old balance and balance change for both originating and destination accounts, followed by calculating the difference in balance changes between the origin and destination accounts.
head(data)
## step amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1 1 9839.64 498 170136 160296.36 563 0
## 2 1 1864.28 727 21249 19384.72 613 0
## 3 1 181.00 293 181 0.00 124 0
## 4 1 181.00 977 181 0.00 92 21182
## 5 1 11668.14 202 41554 29885.86 10 0
## 6 1 7817.71 259 53860 46042.29 721 0
## newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1 0 FALSE 0 0 0 1
## 2 0 FALSE 0 0 0 1
## 3 0 TRUE 0 0 0 0
## 4 0 TRUE 0 1 0 0
## 5 0 FALSE 0 0 0 1
## 6 0 FALSE 0 0 0 1
## type_TRANSFER day period ratio_orig change_orig ratio_dest change_dest
## 1 0 1 Night 0.05783397 9839.64 0.000000000 0
## 2 0 1 Night 0.08773495 1864.28 0.000000000 0
## 3 1 1 Night 1.00000000 181.00 0.000000000 0
## 4 0 1 Night 1.00000000 181.00 0.008544991 21182
## 5 0 1 Night 0.28079463 11668.14 0.000000000 0
## 6 0 1 Night 0.14514872 7817.71 0.000000000 0
## change_diff
## 1 9839.64
## 2 1864.28
## 3 181.00
## 4 -21001.00
## 5 11668.14
## 6 7817.71
options(scipen = 999)
plot_data <- data %>%
group_by(isFraud) %>%
summarise(
CASH_OUT = sum(type_CASH_OUT),
DEBIT = sum(type_DEBIT),
PAYMENT = sum(type_PAYMENT),
TRANSFER = sum(type_TRANSFER)
)
plot_data_long <- plot_data %>%
pivot_longer(
cols = c(CASH_OUT, DEBIT, PAYMENT, TRANSFER),
names_to = "Transaction_Type",
values_to = "Value"
)
plot_data_long$isFraud <- as.factor(plot_data_long$isFraud)
ggplot(plot_data_long, aes(x = Transaction_Type, y = Value, fill = isFraud)) +
geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8), color = "black") +
geom_text(aes(label = Value),
vjust = -0.5,
size = 3,
position = position_dodge(width = 0.8),
color = "black") +
scale_y_continuous(trans = "identity") +
labs(
title = "Count Plot of Transaction Types by Fraud Status",
x = "Transaction Type",
y = "Number of Transactions",
fill = "Fraud Status"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.position = "right"
) +
scale_fill_manual(
values = c("FALSE" = "skyblue", "TRUE" = "#FF6666"),
labels = c("Non-Fraudulent", "Fraudulent")
)
Insight:
cols_of_interest <- c("amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")
data_filtered <- data[cols_of_interest]
for (col_name in cols_of_interest) {
log_col_name <- paste0("Log_", col_name)
data_filtered[[log_col_name]] <- log(data_filtered[[col_name]] + 1)
p <- ggplot(data_filtered, aes(x = .data[[log_col_name]])) +
geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of Log-transformed", col_name),
x = paste("Log(", col_name, " + 1)", sep = ""),
y = "Frequency") +
theme_minimal()
print(p)
}
Insights:
Log-Transformed Transaction Amount:
Log-Transformed Old Balance Origin (oldbalanceOrg):
Log-Transformed New Balance Origin (newbalanceOrig):
#correlation matrix using hierarchical clustering
num_cols <- sapply(data, is.numeric)
cor_matrix <- cor(data[, num_cols], use = "complete.obs")
## Warning in cor(data[, num_cols], use = "complete.obs"): the standard deviation
## is zero
cor_matrix <- cor_matrix[rownames(cor_matrix) != "type_CASH_IN", colnames(cor_matrix) != "type_CASH_IN"]
dist_matrix <- as.dist(1 - cor_matrix)
if (any(is.na(dist_matrix))) {
stop("Distance matrix contains NA values.")
}
if (any(dist_matrix < 0, na.rm = TRUE)) {
stop("Distance matrix contains negative values.")
}
hclust_order <- hclust(dist_matrix)$order
cor_matrix <- cor_matrix[hclust_order, hclust_order]
corrplot(
cor_matrix,
method = "circle",
type = "full",
tl.cex = 0.8,
cl.cex = 0.8,
tl.col = "black",
addrect = 2
)
Insights
Strong correlations:
head(data)
## step amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1 1 9839.64 498 170136 160296.36 563 0
## 2 1 1864.28 727 21249 19384.72 613 0
## 3 1 181.00 293 181 0.00 124 0
## 4 1 181.00 977 181 0.00 92 21182
## 5 1 11668.14 202 41554 29885.86 10 0
## 6 1 7817.71 259 53860 46042.29 721 0
## newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1 0 FALSE 0 0 0 1
## 2 0 FALSE 0 0 0 1
## 3 0 TRUE 0 0 0 0
## 4 0 TRUE 0 1 0 0
## 5 0 FALSE 0 0 0 1
## 6 0 FALSE 0 0 0 1
## type_TRANSFER day period ratio_orig change_orig ratio_dest change_dest
## 1 0 1 Night 0.05783397 9839.64 0.000000000 0
## 2 0 1 Night 0.08773495 1864.28 0.000000000 0
## 3 1 1 Night 1.00000000 181.00 0.000000000 0
## 4 0 1 Night 1.00000000 181.00 0.008544991 21182
## 5 0 1 Night 0.28079463 11668.14 0.000000000 0
## 6 0 1 Night 0.14514872 7817.71 0.000000000 0
## change_diff
## 1 9839.64
## 2 1864.28
## 3 181.00
## 4 -21001.00
## 5 11668.14
## 6 7817.71
Investigate consistency of balances. First scatter plot help to understand the relationship between the sender’s original and new balances after transactions, also helps detect unusual or unexpected patterns, like abrupt zero balances or discrepancies. Second scatter plot aims to study how transactions impact the recipient’s account balance. And helps spot anomalies where balances remain unchanged (potential red flags for fraud or simulation.
# Scatter plot for oldbalanceOrg vs newbalanceOrig
plotorg <- ggplot(data, aes(x = oldbalanceOrg, y = newbalanceOrig)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Scatter Plot: oldbalanceOrg vs newbalanceOrig",
x = "Old Balance Origin",
y = "New Balance Origin") +
theme_minimal()
print(plotorg)
# Scatter plot for oldbalanceDest vs newbalanceDest
plotdest <- ggplot(data, aes(x = oldbalanceDest, y = newbalanceDest)) +
geom_point(alpha = 0.5, color = "lightgreen") +
labs(title = "Scatter Plot: oldbalanceDest vs newbalanceDest",
x = "Initial balance of recipient before the transaction",
y = "New balance of recipient after the transaction") +
theme_minimal()
# Print the second plot
print(plotdest)
To investigate fraud patterns by comparing amounts and balance changes for fraudulent and non-fraudulent transactions.
# Violin plot for Transaction Amount by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = amount, fill = as.factor(isFraud))) +
geom_violin(trim = FALSE) +
scale_y_log10() + # Use log scale for better visualization if amounts vary widely
labs(title = "Transaction Amount Distribution by Fraud Status",
x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
y = "Transaction Amount (log scale)") +
theme_minimal() +
theme(legend.position = "none")
The distribution of transaction amounts for both fraudulent and non-fraudulent transactions. The shape of the violin indicates the density (frequency) of transactions at various amount levels. Besides, log scale is used to better visualize the wide range of transaction amounts. If the fraudulent distribution is skewed towards higher amounts compared to non-fraudulent transactions, it suggests fraud often involves larger sums of money. And peaks (wider sections) indicate common transaction amounts. Other side, narrow tails in the non-fraudulent category might indicate a consistent range of typical transaction amounts, while broader tails for fraud might highlight its unpredictability.
# Violin plot for Balance Changes (oldbalanceOrg - newbalanceOrig) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = oldbalanceOrg - newbalanceOrig, fill = as.factor(isFraud))) +
geom_violin(trim = FALSE) +
labs(title = "Origin Balance Change Distribution by Fraud Status",
x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
y = "Balance Change (oldbalanceOrg - newbalanceOrig)") +
theme_minimal() +
theme(legend.position = "none")
Non-Fraudulent Transactions (FALSE)
The distribution is very narrow, with most balance changes concentrated near small values (close to 0). This indicates that for non-fraudulent transactions, the sender’s balance changes are typically modest and predictable. The balance change pattern is consistent with legitimate transactions, where the deducted amount aligns with normal activities.
Fraudulent Transactions (TRUE)
The distribution is much broader, and balance changes can be extremely large, reaching values in the millions. This suggests that fraudulent transactions often involve significant deductions from the sender’s account. The sharp contrast in balance change magnitude highlights that fraud frequently involves outliers (large, irregular transactions).
# Violin plot for Balance Changes (newbalanceDest - oldbalanceDest) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = newbalanceDest - oldbalanceDest, fill = as.factor(isFraud))) +
geom_violin(trim = FALSE) +
labs(title = "Destination Balance Change Distribution by Fraud Status",
x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
y = "Balance Change (newbalanceDest - oldbalanceDest)") +
theme_minimal() +
theme(legend.position = "none")
Non-Fraudulent Transactions (FALSE)
The distribution is very narrow, with balance changes clustered near zero. This indicates that in most legitimate transactions, recipients’ balances either increase modestly or stay consistent with expected transaction behavior. The low variability suggests predictable and consistent crediting in legitimate transactions.
Fraudulent Transactions (TRUE)
The distribution for fraudulent transactions also exhibits a narrow range, but there are significant outliers. These outliers indicate: Large negative balance changes, where credits do not align with the deducted amounts from the sender. Inconsistent or unusual behavior, possibly due to incomplete or simulated transactions.
# Split data sets
data_rf <- data
data_rf$isFraud <- as.factor(data_rf$isFraud)
levels(data_rf$isFraud) <- c("NonFraud", "Fraud")
set.seed(123)
train_index <- createDataPartition(data_rf$isFraud, p = 0.7, list = FALSE)
train_data <- data_rf[train_index, ]
test_data <- data_rf[-train_index, ]
# Train the random forest model
set.seed(123)
rf_model <- randomForest(isFraud ~ ., data = train_data, ntree = 100, importance = TRUE)
rf_predictions <- predict(rf_model, newdata = test_data, type = "class")
rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$isFraud)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NonFraud Fraud
## NonFraud 114659 3
## Fraud 1 166
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9985
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9881
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 1.0000
## Specificity : 0.9822
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9940
## Prevalence : 0.9985
## Detection Rate : 0.9985
## Detection Prevalence : 0.9985
## Balanced Accuracy : 0.9911
##
## 'Positive' Class : NonFraud
##
# Convert training and test sets to matrix format
X_train <- model.matrix(isFraud ~ . - 1, data = train_data)
X_test <- model.matrix(isFraud ~ . - 1, data = test_data)
class(X_train) # should be "matrix"
## [1] "matrix" "array"
y_train <- as.numeric(train_data$isFraud) - 1 # Converting factors to 0 and 1
y_test <- as.numeric(test_data$isFraud) - 1
# Setting parameters
params <- list(
objective = "binary:logistic",
eval_metric = "error",
max_depth = 6,
eta = 0.3
)
# Training the XGBoost model
set.seed(123)
xgb_model <- xgboost(
data = X_train,
label = y_train,
params = params,
nrounds = 100,
verbose = 0
)
xgb_predictions <- predict(xgb_model, newdata = X_test)
xgb_pred_labels <- ifelse(xgb_predictions > 0.5, "Fraud", "NonFraud")
# Evaluate XGBoost models
xgb_conf_matrix <- confusionMatrix(factor(xgb_pred_labels, levels = c("NonFraud", "Fraud")), test_data$isFraud)
print(xgb_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NonFraud Fraud
## NonFraud 114660 4
## Fraud 0 165
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9985
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.988
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9763
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.9985
## Detection Rate : 0.9985
## Detection Prevalence : 0.9986
## Balanced Accuracy : 0.9882
##
## 'Positive' Class : NonFraud
##
rf_accuracy <- rf_conf_matrix$overall["Accuracy"]
rf_sensitivity <- rf_conf_matrix$byClass["Sensitivity"]
rf_specificity <- rf_conf_matrix$byClass["Specificity"]
xgb_accuracy <- xgb_conf_matrix$overall["Accuracy"]
xgb_sensitivity <- xgb_conf_matrix$byClass["Sensitivity"]
xgb_specificity <- xgb_conf_matrix$byClass["Specificity"]
# Creating a Performance Comparison Table
performance_comparison <- data.frame(
Model = c("Random Forest", "XGBoost"),
Accuracy = c(rf_accuracy, xgb_accuracy),
Sensitivity = c(rf_sensitivity, xgb_sensitivity),
Specificity = c(rf_specificity, xgb_specificity)
)
print(performance_comparison)
## Model Accuracy Sensitivity Specificity
## 1 Random Forest 0.9999652 0.9999913 0.9822485
## 2 XGBoost 0.9999652 1.0000000 0.9763314
performance_long <- melt(performance_comparison, id.vars = "Model", variable.name = "Metric", value.name = "Value")
# Plotting Performance Comparison Bar Graphs
ggplot(performance_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
labs(title = "Model Performance Comparison", x = "Metric", y = "Value") +
theme_minimal()
# Feature Importance
rf_importance <- importance(rf_model)
rf_importance <- data.frame(Feature = rownames(rf_importance), Importance = rf_importance[, 1])
rf_importance <- rf_importance[order(-rf_importance$Importance), ]
library(ggplot2)
ggplot(rf_importance, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
coord_flip() +
labs(title = "Feature Importance from Random Forest", x = "Features", y = "Importance") +
theme_minimal()
#Building model
n_features <- ncol(data_rf) - 1
# Setting up cross-validation controls for random searches
train_control_random <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Random search
set.seed(123)
rf_random_search <- train(
isFraud ~ .,
data = train_data,
method = "rf",
metric = "ROC",
tuneLength = 10,
trControl = train_control_random,
ntree = 100
)
print(rf_random_search)
## Random Forest
##
## 267937 samples
## 20 predictor
## 2 classes: 'NonFraud', 'Fraud'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9999890 0.9999925 0.9417722
## 4 0.9999973 0.9999888 0.9544304
## 6 0.9999956 0.9999888 0.9518987
## 8 0.9999950 0.9999850 0.9518987
## 10 0.9999939 0.9999888 0.9518987
## 13 0.9999717 0.9999888 0.9518987
## 15 0.9949039 0.9999888 0.9518987
## 17 0.9949044 0.9999888 0.9518987
## 19 0.9873092 0.9999888 0.9518987
## 22 0.9885763 0.9999888 0.9518987
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
best_mtry <- rf_random_search$bestTune$mtry
cat("Best mtry from Random Search:", best_mtry, "\n")
## Best mtry from Random Search: 4
# Grid Search
# Narrow down mtry based on random search results
rf_grid <- expand.grid(
mtry = seq(max(1, best_mtry - 2), best_mtry + 2, by = 1) #Make sure mtry is not less than 1
)
# Setting up cross-validation controls for grid searches
train_control_grid <- trainControl(
method = "cv", # cross-validation
number = 5, # 5-fold
classProbs = TRUE,
summaryFunction = twoClassSummary # Use of ROC as an assessment indicator
)
# Grid Search
set.seed(123)
rf_grid_search <- train(
isFraud ~ .,
data = train_data,
method = "rf",
metric = "ROC",
tuneGrid = rf_grid,
trControl = train_control_grid,
ntree = 100
)
print(rf_grid_search)
## Random Forest
##
## 267937 samples
## 20 predictor
## 2 classes: 'NonFraud', 'Fraud'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9999898 0.9999925 0.9518987
## 3 0.9999943 0.9999888 0.9518987
## 4 0.9999965 0.9999888 0.9518987
## 5 0.9999958 0.9999888 0.9544304
## 6 0.9999974 0.9999850 0.9569620
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
cat("Best mtry from Grid Search:", rf_grid_search$bestTune$mtry, "\n")
## Best mtry from Grid Search: 6
# Model Evaluation
# Evaluate the final model on the test set
final_model <- rf_grid_search$finalModel
test_predictions <- predict(rf_grid_search, newdata = test_data, type = "prob")[, 2]
# Calculate AUC
test_roc <- roc(test_data$isFraud, test_predictions)
## Setting levels: control = NonFraud, case = Fraud
## Setting direction: controls < cases
cat("Test AUC:", auc(test_roc), "\n")
## Test AUC: 0.9970317
# Plotting ROC curves
plot(test_roc, main = "ROC Curve for Final Model")
As can be seen from the provided ROC curve graphs, the model’s classification performance is very strong. The curve is close to the upper left corner, indicating that the model performs well in distinguishing between positive (fraudulent) and negative (non-fraudulent) samples with high Sensitivity and Specificity. In addition, the area under the ROC curve (AUC) is expected to be close to 1, which further validates that the model’s classification ability is almost optimal under different thresholds. However, this near-perfect performance may require vigilance against the possibility of overfitting, especially if the test data is too similar to the training data distribution or there is a data leakage problem. A combination of confusion matrices, classification metrics (e.g., precision and recall), and further examination of the data distribution is needed to ensure the robustness and generalisation ability of the model.
# Confusion Matrix
test_predictions_class <- predict(rf_grid_search, newdata = test_data, type = "raw")
confusion_matrix <- confusionMatrix(test_predictions_class, test_data$isFraud)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction NonFraud Fraud
## NonFraud 114659 2
## Fraud 1 167
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.9985
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9911
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9882
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9940
## Prevalence : 0.9985
## Detection Rate : 0.9985
## Detection Prevalence : 0.9985
## Balanced Accuracy : 0.9941
##
## 'Positive' Class : NonFraud
##
From the confusion matrix and statistical metrics, the Random Forest model performed extremely well in the classification task. The confusion matrix shows that the model had very few misclassifications in the test set, with only 1 transaction in the NonFraud category being misclassified as Fraud and 2 in the Fraud category being misclassified as NonFraud. The overall Accuracy of the model is 1, with an almost perfect 95% confidence interval (0.9999, 1). The Sensitivity of 1 indicates that the model correctly detects all positive samples (Fraud), while the Specificity of 0.9882 indicates that the model is also very accurate in classifying negative samples (NonFraud). The Positive Pred Value (Pos Pred Value) of 1 indicates that all samples predicted as Fraud are true Fraud, while the Negative Pred Value (Neg Pred Value) of 0.994 indicates that most of the samples predicted as NonFraud are also accurate.
# Subset the data for fraudulent transactions
fraud_data <- subset(data, isFraud == 1)
# Remove rows with missing values in the 'amount' column
fraud_data <- fraud_data[!is.na(fraud_data$amount), ]
# Print the distribution of transaction types
cat("Distribution of transaction types:\n")
## Distribution of transaction types:
print(colSums(fraud_data[, grep("^type_", colnames(fraud_data))]))
## type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT type_TRANSFER
## 0 21 0 0 543
# Remove transaction types with zero samples
fraud_data <- fraud_data[, !colnames(fraud_data) %in% c("type_CASH_IN", "type_DEBIT", "type_PAYMENT")]
# Define features and target variable
features <- c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest",
"type_CASH_OUT", "type_TRANSFER")
y <- fraud_data$amount
# Standardize numerical features
fraud_data[, c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")] <-
scale(fraud_data[, c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")])
# Split the data into training and testing sets
set.seed(42)
train_index <- sample(1:nrow(fraud_data), 0.7 * nrow(fraud_data))
train_data <- fraud_data[train_index, ]
test_data <- fraud_data[-train_index, ]
# Prepare training and testing matrices for glmnet
X_train <- as.matrix(train_data[, features])
y_train <- train_data$amount
X_test <- as.matrix(test_data[, features])
y_test <- test_data$amount
# Perform Lasso regression with cross-validation
lasso_model <- cv.glmnet(X_train, y_train, alpha = 1)
plot(lasso_model)
# Retrieve the best lambda value
best_lambda <- lasso_model$lambda.min
cat("Optimal lambda value:", best_lambda, "\n")
## Optimal lambda value: 28366.47
# Train the final model with the best lambda
final_model <- glmnet(X_train, y_train, alpha = 1, lambda = best_lambda)
# Predict on the testing set
lasso_predictions <- predict(final_model, s = best_lambda, newx = X_test)
# Evaluate the model
lasso_rmse <- sqrt(mean((lasso_predictions - y_test)^2))
lasso_mae <- mean(abs(lasso_predictions - y_test))
lasso_r2 <- 1 - sum((y_test - lasso_predictions)^2) / sum((y_test - mean(y_test))^2)
cat("Model evaluation:\n")
## Model evaluation:
cat("RMSE:", lasso_rmse, "\n")
## RMSE: 58577.56
cat("MAE:", lasso_mae, "\n")
## MAE: 36915.42
cat("R-squared:", lasso_r2, "\n")
## R-squared: 0.998981
# Create a list of linear regression features
linear_features <- c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest",
"type_CASH_OUT", "type_TRANSFER")
# linear regression modeling
linear_model <- lm(amount ~ ., data = train_data[, c(linear_features, "amount")])
# Predicting on the test set
linear_predictions <- predict(linear_model, newdata = test_data[, linear_features])
# Model Evaluation
linear_rmse <- sqrt(mean((linear_predictions - test_data$amount)^2))
linear_r2 <- 1 - sum((test_data$amount - linear_predictions)^2) / sum((test_data$amount - mean(test_data$amount))^2)
cat("Linear Regression Evaluation:\n")
## Linear Regression Evaluation:
cat("RMSE:", linear_rmse, "\n")
## RMSE: 3807.065
cat("R-squared:", linear_r2, "\n")
## R-squared: 0.9999957
# Adding secondary features to training and test sets
train_data$oldbalanceOrg_sq <- train_data$oldbalanceOrg^2
test_data$oldbalanceOrg_sq <- test_data$oldbalanceOrg^2
train_data$newbalanceOrig_sq <- train_data$newbalanceOrig^2
test_data$newbalanceOrig_sq <- test_data$newbalanceOrig^2
# Create a list of polynomial features
poly_features <- c("oldbalanceOrg", "newbalanceOrig", "oldbalanceOrg_sq", "newbalanceOrig_sq",
"oldbalanceDest", "newbalanceDest", "type_CASH_OUT", "type_TRANSFER")
# polynomial regression modeling
poly_model <- lm(amount ~ ., data = train_data[, c(poly_features, "amount")])
poly_predictions <- predict(poly_model, newdata = test_data[, poly_features])
# Model Evaluation
poly_rmse <- sqrt(mean((poly_predictions - test_data$amount)^2))
poly_r2 <- 1 - sum((test_data$amount - poly_predictions)^2) / sum((test_data$amount - mean(test_data$amount))^2)
cat("Polynomial Regression Evaluation:\n")
## Polynomial Regression Evaluation:
cat("RMSE:", poly_rmse, "\n")
## RMSE: 3903.402
cat("R-squared:", poly_r2, "\n")
## R-squared: 0.9999955
# Summarize results into tables
results <- data.frame(
Model = c("Lasso Regression", "Polynomial Regression", "Linear Regression"),
RMSE = c(lasso_rmse, poly_rmse, linear_rmse),
R_squared = c(lasso_r2, poly_r2, linear_r2)
)
kable(results, caption = "Comparison of Regression Models")
| Model | RMSE | R_squared |
|---|---|---|
| Lasso Regression | 58577.562 | 0.9989810 |
| Polynomial Regression | 3903.402 | 0.9999955 |
| Linear Regression | 3807.065 | 0.9999957 |
# Creating a comparison data frame
comparison <- data.frame(
Actual = test_data$amount,
Polynomial_Predicted = as.vector(poly_predictions),
Linear_Predicted = as.vector(linear_predictions),
Lasso_Predicted = as.vector(lasso_predictions)
)
# actual vs lasso regression predictions
ggplot(comparison, aes(x = Actual, y = Lasso_Predicted)) +
geom_point(alpha = 0.6, color = "yellow") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Lasso_ Regression: Actual vs Predicted",
x = "Actual Values", y = "Predicted Values") +
theme_minimal()
# actual vs. predicted by multinomial regression
ggplot(comparison, aes(x = Actual, y = Polynomial_Predicted)) +
geom_point(alpha = 0.6, color = "blue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Polynomial Regression: Actual vs Predicted",
x = "Actual Values", y = "Predicted Values") +
theme_minimal()
# Actual vs. predicted by linear regression
ggplot(comparison, aes(x = Actual, y = Linear_Predicted)) +
geom_point(alpha = 0.6, color = "green") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Linear Regression: Actual vs Predicted",
x = "Actual Values", y = "Predicted Values") +
theme_minimal()
# Feature Importance Analysis
coef_values <- as.vector(coef(final_model, s = best_lambda)[-1]) # Remove intercept
feature_importance <- data.frame(Feature = features, Importance = abs(coef_values))
feature_importance <- feature_importance[order(-feature_importance$Importance), ]
# Plot feature importance
ggplot(feature_importance, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
coord_flip() +
labs(title = "Feature Importance from Lasso Regression Model", x = "Features", y = "Importance") +
theme_minimal()
# Extract coefficients for polynomial regression models
poly_coefficients <- coef(poly_model)
# Creating a Characteristic Importance Data Frame
feature_importance_poly <- data.frame(
Feature = names(poly_coefficients)[-1],
Importance = abs(poly_coefficients[-1])
)
# In order of importance
feature_importance_poly <- feature_importance_poly[order(-feature_importance_poly$Importance), ]
# Characteristic importance of visual polynomial regression
ggplot(feature_importance_poly, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "green", alpha = 0.7) +
coord_flip() +
labs(title = "Feature Importance from Polynomial Regression",
x = "Features", y = "Importance") +
theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).
# Extract the coefficients of the linear regression model
linear_coefficients <- coef(linear_model)
# Creating a Characteristic Importance Data Frame
feature_importance_linear <- data.frame(
Feature = names(linear_coefficients)[-1],
Importance = abs(linear_coefficients[-1])
)
feature_importance_linear <- feature_importance_linear[order(-feature_importance_linear$Importance), ]
## Visualize the importance of the characteristics of linear regression
ggplot(feature_importance_linear, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "orange", alpha = 0.7) +
coord_flip() +
labs(title = "Feature Importance from Linear Regression",
x = "Features", y = "Importance") +
theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).
# Importance of merger features
combined_feature_importance <- merge(
feature_importance,
feature_importance_poly,
by = "Feature",
all = TRUE,
suffixes = c("_Lasso", "_Poly")
)
combined_feature_importance <- merge(
combined_feature_importance,
feature_importance_linear,
by = "Feature",
all = TRUE,
suffixes = c("", "_Linear")
)
colnames(combined_feature_importance) <- c("Feature", "Importance_Lasso", "Importance_Poly", "Importance_Linear")
combined_feature_importance[is.na(combined_feature_importance)] <- 0
combined_feature_importance <- combined_feature_importance[order(-combined_feature_importance$Importance_Lasso), ]
print(combined_feature_importance)
## Feature Importance_Lasso Importance_Poly Importance_Linear
## 5 oldbalanceOrg 2349177 2424834.34021 2424294.6363237
## 2 newbalanceOrig 634935 704887.83231 710621.3190200
## 1 newbalanceDest 0 13.94108 0.6389738
## 3 newbalanceOrig_sq 0 355.36431 0.0000000
## 4 oldbalanceDest 0 15.94334 0.7334692
## 6 oldbalanceOrg_sq 0 430.86093 0.0000000
## 7 type_CASH_OUT 0 122.87337 12.4795722
## 8 type_TRANSFER 0 0.00000 0.0000000
Evaluation
The Random Forest and XGBoost models were evaluated in an extensive manner using a number of key performance metrics with the aim of ensuring the robustness and generalizability of the models in detecting fraudulent transactions. Both models were optimized using hyperparameter tuning, cross-validation, and then evaluated using the following metrics:
ROC-AUC: This metric assesses the models’ ability to discriminate between fraudulent and non-fraudulent transactions. A higher ROC-AUC score indicates better model performance.
Recall (Sensitivity): Sensitivity ensures that fraudulent transactions are correctly identified, minimizing false negatives.
Specificity: Specificity measures the model’s ability to correctly identify non-fraudulent transactions, reducing false positives.
Accuracy: Accuracy is the overall correctness of the model in classifying both fraudulent and non-fraudulent transactions.
Confusion Matrix: This provides a detailed breakdown of correct and incorrect predictions, offering insight into false positives and false negatives.
Kappa Score: The Kappa score measures the agreement between the model’s predictions and the actual outcomes, accounting for chance.
P-Value: The p-value tests the statistical significance of the model’s performance. Both models were optimized using Random Search and Grid Search for hyperparameter tuning, resulting in high-performance levels. A 5-fold cross-validation approach was used to ensure the models’ robustness and generalizability.
Results
Random Forest Model:
Accuracy: 1.0 (95% CI: 0.9999–1)
Sensitivity (Recall): 1.0000, ensuring no fraudulent transactions were missed.
Specificity: 0.9822, indicating a high rate of correctly identifying legitimate transactions.
Confusion Matrix: 114,659 non-fraudulent and 166 fraudulent transactions correctly classified, with minimal misclassifications (2 false negatives and 1 false positive).
Kappa: 0.9881, signifying a strong agreement between predictions and actual outcomes.
P-Value: <0.0000000000000002, showing statistically significant performance.
McNemar’s Test P-Value: 0.6171, indicating no significant difference in error rates between fraud and non-fraud classifications.
XGBoost Model:
Accuracy: 1.0 (95% CI: 0.9999–1)
Sensitivity: 1.0000, ensuring that all fraudulent transactions were identified.
Specificity: 0.9763, demonstrating strong precision in classifying non-fraudulent transactions.
Confusion Matrix: 114,660 non-fraudulent and 165 fraudulent transactions correctly classified, with 4 false negatives and no false positives.
Kappa: 0.988, showing excellent agreement between predictions and actual outcomes. P-Value: <0.0000000000000002, indicating statistically significant performance.
McNemar’s Test P-Value: 0.1336, indicating no significant difference in error rates between the two classes.
Model Comparison:
Both models had almost identical accuracy scores, 0.9999652, and perfect sensitivity, 1.0000, meaning that no fraudulent transactions were missed.
The Random Forest model had a slightly higher specificity of 0.9822, while the XGBoost model had a slightly lower specificity of 0.9763 but perfectly classified fraudulent transactions with no false positives. The performance of both models was rather outstanding, though XGBoost was a little bit better with sensitivity and Random Forest was a little bit better regarding specificity.
Evaluation of Models
Lasso Regression:
Optimal Lambda: 28,366.47 Model Evaluation: RMSE: 58,577.56 MAE: 36,915.42 R-squared: 0.998981
Lasso Regression performed very well, with a high R-squared value, hence a good fit to the data. However, there was slight underestimation of the predicted values, which is evident from the scatter plot, where most of the points lie below the diagonal reference line. However, this model provides a very good insight into the important factors that drive the amount in fraudulent transactions.
Linear Regression:
Model Evaluation: RMSE: 3,807.065 R-squared: 0.9999957
Linear Regression performed very well. With the R-squared value so close to its maximum, it indeed fitted well for both forecasted and actual values. For this value of RMSE, the real results have minor dispersion in their estimations; because of this fact, most scatter points fall close to an ideal reference line. With minimal underestimation, this was one of the powerful forecast models for the target variable.
Polynomial Regression:
Model Evaluation: RMSE: 3,903.402 R-squared: 0.9999955
Polynomial Regression also had a really great performance; its R-squared value is close to Linear Regression. Looking from the scatter plot, there is slight underestimation that may indicate that indeed, Polynomial Regression was as effective, but with marginally more error from Linear Regression. However, it captured the complicated relationship between the factors and provided the necessary insights in the data.
Feature Importance Analysis
The feature importance across the different models, namely, Lasso Regression, Polynomial Regression, and Linear Regression, are giving the important features that mostly influence the predictions of fraudulent transaction amount. Key takeaways for the above Feature Importance plots for different models are as follows:
Lasso Regression: oldbalanceOrg and newbalanceOrig are the features with the maximum importance values; newbalanceDest, newbalanceOrig_sq, and type_CASH_OUT are features for which the feature importance values turn out to be zero, showing their minimal contribution toward the model in question.
Polynomial Regression: This model underlined two most influential features, oldbalanceOrg and newbalanceOrig, although several other features were given lower yet important values, such as newbalanceDest and newbalanceOrig_sq, while type_CASH_OUT also had a moderate value of importance - not reflected from the LASSO model.
Linear Regression: As in Polynomial Regression, Linear Regression granted a high feature importance to oldbalanceOrg and newbalanceOrig, supplementing with smaller contributions with features such as newbalanceDest and type_CASH_OUT.
Objective 1:
The key objective of this analysis was to predict the amount of a fraudulent transaction based on several features of the transaction, and both Random Forest and XGBoost models were highly successful in this. Both models showed great results with nearly identical accuracy and sensitivity scores, hence fraud transactions are surely detected.
The Random Forest model performed slightly better in terms of specificity, where it could correctly identify the non-fraudulent transactions, while XGBoost showed perfect sensitivity, thus detecting all fraudulent transactions with perfect precision. Both models performed remarkably well in predicting the fraudulent transaction amounts and hence minimized false positives and false negatives, making a reliable fraud detection system.
In all, the goal of estimating fraudulent transaction amounts was accomplished, and both models are strong tools in detecting fraud with high accuracy. This consolidates the fact that machine learning models, especially those like Random Forest and XGBoost, do actually fit well for fraud detection tasks and may be deployed in the real world for that purpose.
Objective 2:
The use of models such as Lasso Regression, Linear Regression, and Polynomial Regression successfully pinned the most influential features that were driving the amount of fraudulent transactions. Its consistency across different models places oldbalanceOrg and newbalanceOrig among the important features in order to predict how much fraudulent amount of transaction hence very important on fraud detection. First, these two features had the biggest importance values on all the models viewed, which indicated their major and significant role in influencing transaction amounts.
While other features like newbalanceDest, newbalanceOrig_sq, and type_CASH_OUT were important to various extents in the models, they did not bear as much influence as oldbalanceOrg and newbalanceOrig. Specifically, Lasso Regression assigned zero importance to many features, while Linear and Polynomial Regression models captured a broader set of relevant features.
These critical factors were identified by understanding the importance of each feature within the prediction models, along with performance evaluation metrics. The R-squared values being close to 1 in both Linear Regression and Polynomial Regression revealed a very strong fit between the predicted and actual values, confirming that the models have effectively captured the underlying patterns in fraudulent transaction amounts. Second, it seems that all models slightly underestimate making a prediction; although the models are very precise, further adjustments of the models will probably help improve such predictions.
Therefore, the analysis achieved its objective in identifying major factors driving fraudulent transaction amounts and pointed out oldbalanceOrg and newbalanceOrig as most important. The models provided rich insights into what drives fraud and can be useful in further steps of fraud detection and prevention.