Detecting Online Payment Fraud Using Machine Learning Models

Lecturer: Dr. Ang Group 9

Matric	Full Name
23121328	Mohammed Iqram
24052516	LI JUNMING
22106713	LI YUEXIN
23108677	ZHAO ZITONG
23111676	LIU YICONG

Introduction

Fraud detection is an essential field of study due to the increasing prevalence of online transactions and digital payment systems. With billions of dollars lost annually to fraudulent activities, detecting and preventing fraud has become a top priority for organizations to protect financial resources and maintain customer trust. Fraudulent transactions often constitute a small portion of all activities, making detection a challenging task due to data imbalance and the constantly evolving nature of fraudulent patterns.

This study focuses on understanding the key features that contribute to the likelihood of fraud and developing a predictive model to assign risk scores to transactions. By identifying significant features and predicting the likelihood of fraud, businesses can prioritize high-risk cases for investigation, reducing false positives and improving detection efficiency.

Questions:

Can accurately classify a transaction as fraudulent or non-fraudulent using machine learning models? (Classification)
What features contribute most significantly to the likelihood of fraud, and can predict the risk score? (Regression)

Objectives:

To accurately classify a transaction as fraudulent or non-fraudulent using machine learning models.
To determine features contribute most significantly to the likelihood of fraud.

1.0 The Details of Raw data

rawdata <- read.csv('dirty_data.csv')

head(rawdata) # check first six rows of rawdata

##   step     type   amount    nameOrig oldbalanceOrg newbalanceOrig    nameDest
## 1    1  PAYMENT  9839.64 C1231006815        170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295         21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145           181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671           181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720         41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638         53860       46042.29  M573487274
##   oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## 1              0              0       0              0
## 2              0              0       0              0
## 3              0              0       1              0
## 4          21182              0       1              0
## 5              0              0       0              0
## 6              0              0       0              0

Basic information about the dataset

Title: onlinefraud.csv

Year：2022

Purpose of dataset：Online paymentf Fraud detection by machine learning

Dimension：
Number of rows: 1,048,575
Number of columns: 11

Data size: 493.53 MB

Meaning of Column Names:

Column Name	Description
step	Represents a unit of time where 1 step equals 1 hour
type	Type of online transaction
amount	The amount of the transaction
nameOrig	Customer starting the transaction
oldbalanceOrg	Balance before the transaction
newbalanceOrig	Balance after the transaction
nameDest	Recipient of the transaction
oldbalanceDest	Initial balance of recipient before the transaction
newbalanceDest	New balance of recipient after the transaction
isFraud	Indicates whether the transaction is fraudulent (1 = Fraud)

# Check the dimension of the dataset
dim(rawdata)

## [1] 1048575      11

# Structure of the dataset
str(rawdata)

## 'data.frame':    1048575 obs. of  11 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : chr  "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ isFlaggedFraud: int  0 0 0 0 0 0 0 0 0 0 ...

# Summary of the dataset
summary(rawdata)

##       step           type               amount           nameOrig        
##  Min.   : 1.00   Length:1048575     Min.   :       0   Length:1048575    
##  1st Qu.:15.00   Class :character   1st Qu.:   12149   Class :character  
##  Median :20.00   Mode  :character   Median :   76343   Mode  :character  
##  Mean   :26.97                      Mean   :  158667                     
##  3rd Qu.:39.00                      3rd Qu.:  213762                     
##  Max.   :95.00                      Max.   :10000000                     
##                                     NA's   :6                            
##  oldbalanceOrg      newbalanceOrig       nameDest         oldbalanceDest    
##  Min.   :       0   Min.   :       0   Length:1048575     Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   Class :character   1st Qu.:       0  
##  Median :   16002   Median :       0   Mode  :character   Median :  126377  
##  Mean   :  874010   Mean   :  893809                      Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600                      3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000                      Max.   :42100000  
##                                                                             
##  newbalanceDest        isFraud         isFlaggedFraud
##  Min.   :       0   Min.   :0.000000   Min.   :0     
##  1st Qu.:       0   1st Qu.:0.000000   1st Qu.:0     
##  Median :  218260   Median :0.000000   Median :0     
##  Mean   : 1114198   Mean   :0.001089   Mean   :0     
##  3rd Qu.: 1149808   3rd Qu.:0.000000   3rd Qu.:0     
##  Max.   :42200000   Max.   :1.000000   Max.   :0     
##

2.0 Data Cleaning

data <- rawdata # for data cleaning purpose

2.1 Drop no meaning column

data <- rawdata %>% select(-"isFlaggedFraud")  # df <- df %>% select(-B)
head(data)

##   step     type   amount    nameOrig oldbalanceOrg newbalanceOrig    nameDest
## 1    1  PAYMENT  9839.64 C1231006815        170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295         21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145           181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671           181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720         41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638         53860       46042.29  M573487274
##   oldbalanceDest newbalanceDest isFraud
## 1              0              0       0
## 2              0              0       0
## 3              0              0       1
## 4          21182              0       1
## 5              0              0       0
## 6              0              0       0

2.2 Check for Null Values

colSums(is.na(data))

##           step           type         amount       nameOrig  oldbalanceOrg 
##              0              0              6              0              0 
## newbalanceOrig       nameDest oldbalanceDest newbalanceDest        isFraud 
##              0              0              0              0              0

It shows that the Amount Column has 6 Null values

2.3 Fill the Blank Values

data$amount[is.na(data$amount)] <-
  data$oldbalanceOrg[is.na(data$amount)] - data$newbalanceOrig[is.na(data$amount)]

The blanks are filled logically by ( amount = oldbalanceOrg - newbalanceOrig)

sum(is.na(data))

## [1] 0

There are no blank values

2.4 Check for Duplicates

sum(duplicated(data))

## [1] 0

No duplicates found

2.5 Data Type Validation

str(data)

## 'data.frame':    1048575 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : chr  "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...

data$type <- as.factor(data$type)
data$isFraud <- as.logical(data$isFraud)
data$amount <- as.numeric(data$amount)
data$oldbalanceOrg <- as.numeric(data$oldbalanceOrg)
data$newbalanceOrig <- as.numeric(data$newbalanceOrig)
data$oldbalanceDest <- as.numeric(data$oldbalanceDest)
data$newbalanceDest <- as.numeric(data$newbalanceDest)

summary(data)

##       step             type            amount           nameOrig        
##  Min.   : 1.00   CASH_IN :227130   Min.   : -227335   Length:1048575    
##  1st Qu.:15.00   CASH_OUT:373641   1st Qu.:   12149   Class :character  
##  Median :20.00   DEBIT   :  7178   Median :   76342   Mode  :character  
##  Mean   :26.97   PAYMENT :353873   Mean   :  158666                     
##  3rd Qu.:39.00   TRANSFER: 86753   3rd Qu.:  213762                     
##  Max.   :95.00                     Max.   :10000000                     
##  oldbalanceOrg      newbalanceOrig       nameDest         oldbalanceDest    
##  Min.   :       0   Min.   :       0   Length:1048575     Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   Class :character   1st Qu.:       0  
##  Median :   16002   Median :       0   Mode  :character   Median :  126377  
##  Mean   :  874010   Mean   :  893809                      Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600                      3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000                      Max.   :42100000  
##  newbalanceDest      isFraud       
##  Min.   :       0   Mode :logical  
##  1st Qu.:       0   FALSE:1047433  
##  Median :  218260   TRUE :1142     
##  Mean   : 1114198                  
##  3rd Qu.: 1149808                  
##  Max.   :42200000

2.6 Check of Outlier

boxplot(data$amount, main = "Boxplot for Amount", ylab = "Amount")

boxplot(data$oldbalanceOrg, main = "Boxplot for Old Balance Origin", ylab = "Old Balance Origin")

boxplot(data$newbalanceOrig, main = "Boxplot for New Balance Origin", ylab = "New Balance Origin")

boxplot(data$oldbalanceDest, main = "Boxplot for Old Balance Des", ylab = "Old Balance Des")

boxplot(data$newbalanceDest, main = "Boxplot for New Balance Des", ylab = "New Balance Des")

I am not removing data outliers as they represent valid financial transactions and removing them would lead to a loss of valuable information and potentially biased results.

2.7 Ensure Balance Consistency

# Identify inconsistent rows
inconsistent_rows <- data[data$newbalanceOrig > data$oldbalanceOrg |
                          data$newbalanceDest > data$oldbalanceDest, ]
cat("Inconsistent rows:", nrow(inconsistent_rows), "\n")

## Inconsistent rows: 661442

# Keep only consistent rows
data <- data[data$newbalanceOrig <= data$oldbalanceOrg &
             data$newbalanceDest <= data$oldbalanceDest, ]
cat(all(data$newbalanceOrig <= data$oldbalanceOrg), "\n")

## TRUE

cat(all(data$newbalanceDest <= data$oldbalanceDest), "\n")

## TRUE

str(data)

## 'data.frame':    387133 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : logi  FALSE FALSE TRUE TRUE FALSE FALSE ...

# Keep only consistent rows：38,7133

2.8 Categorical Data Validation

unique(data$type)

## [1] PAYMENT  TRANSFER CASH_OUT DEBIT    CASH_IN 
## Levels: CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER

# unique(data$nameOrig)
# unique(data$nameDest)

Drop rows with type:“CASH_IN”, which is unexpected type to detect fraud

data <- data[data$type != "CASH_IN", ]

cat("Frequency table for 'type':\n")

## Frequency table for 'type':

print(table(data$type))

## 
##  CASH_IN CASH_OUT    DEBIT  PAYMENT TRANSFER 
##        0    24731     1036   353873     3126

str(data)

## 'data.frame':    382766 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : logi  FALSE FALSE TRUE TRUE FALSE FALSE ...

after droped “CASH_IN”： 38,7133 observations >> 38,2766 observations

2.9 One-Hot Encoding for categorical variables(type)

data <- dummy_cols(
  data,
  select_columns = c("type"),
  remove_selected_columns = TRUE
)

Hashing Encoding for categorical variables (nameOrig, nameDest) to convert them into numeric format.

data$nameOrig <- as.integer(sapply(data$nameOrig, function(x) {
  hash_value <- digest(x, algo = "xxhash32", seed = 123)
  as.numeric(paste0("0x", hash_value)) %% 1000 #num_buckets could change based on model
}))
data$nameDest <- as.integer(sapply(data$nameDest, function(x) {
  hash_value <- digest(x, algo = "xxhash32", seed = 123)
  as.numeric(paste0("0x", hash_value)) %% 1000
}))

2.10 Feature Engineering

# Convert steps into days and create bins for different time periods
data <- data %>%
  mutate(day = step %/% 24 + 1 )%>%
  mutate(period = case_when(
    step %% 24 >= 0 & step %% 24 < 6 ~ "Night",
    step %% 24 >= 6 & step %% 24 < 12 ~ "Morning",
    step %% 24 >= 12 & step %% 24 < 18 ~ "Afternoon",
    TRUE ~ "Evening"
  ))

# Calculate ratio of transaction amount to the old balance and balance change
# For the originating account
data <- data %>%
  mutate(ratio_orig = ifelse(oldbalanceOrg > 0, amount / oldbalanceOrg, 0))%>%
  mutate(change_orig = oldbalanceOrg - newbalanceOrig)

# For the destination account
data <- data %>%
  mutate(ratio_dest = ifelse(oldbalanceDest > 0, amount / oldbalanceDest, 0)) %>%
  mutate(change_dest = oldbalanceDest - newbalanceDest)

# Difference between origin and destination balance change
data <- data %>%
  mutate(change_diff = change_orig - change_dest)

3.0 Exploratory Data Analysis(EDA)

head(data)

##   step   amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1    1  9839.64      498        170136      160296.36      563              0
## 2    1  1864.28      727         21249       19384.72      613              0
## 3    1   181.00      293           181           0.00      124              0
## 4    1   181.00      977           181           0.00       92          21182
## 5    1 11668.14      202         41554       29885.86       10              0
## 6    1  7817.71      259         53860       46042.29      721              0
##   newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1              0   FALSE            0             0          0            1
## 2              0   FALSE            0             0          0            1
## 3              0    TRUE            0             0          0            0
## 4              0    TRUE            0             1          0            0
## 5              0   FALSE            0             0          0            1
## 6              0   FALSE            0             0          0            1
##   type_TRANSFER day period ratio_orig change_orig  ratio_dest change_dest
## 1             0   1  Night 0.05783397     9839.64 0.000000000           0
## 2             0   1  Night 0.08773495     1864.28 0.000000000           0
## 3             1   1  Night 1.00000000      181.00 0.000000000           0
## 4             0   1  Night 1.00000000      181.00 0.008544991       21182
## 5             0   1  Night 0.28079463    11668.14 0.000000000           0
## 6             0   1  Night 0.14514872     7817.71 0.000000000           0
##   change_diff
## 1     9839.64
## 2     1864.28
## 3      181.00
## 4   -21001.00
## 5    11668.14
## 6     7817.71

3.1 The distribution of isFraud

Plot of Transaction Types by Fraud Status

options(scipen = 999)

plot_data <- data %>%
  group_by(isFraud) %>%
  summarise(
    CASH_OUT = sum(type_CASH_OUT),
    DEBIT = sum(type_DEBIT),
    PAYMENT = sum(type_PAYMENT),
    TRANSFER = sum(type_TRANSFER)
  )


plot_data_long <- plot_data %>%
  pivot_longer(
    cols = c(CASH_OUT, DEBIT, PAYMENT, TRANSFER),
    names_to = "Transaction_Type",
    values_to = "Value"
  )


plot_data_long$isFraud <- as.factor(plot_data_long$isFraud)


ggplot(plot_data_long, aes(x = Transaction_Type, y = Value, fill = isFraud)) +
  geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8), color = "black") +
  geom_text(aes(label = Value),
            vjust = -0.5,
            size = 3,
            position = position_dodge(width = 0.8),
            color = "black") +
  scale_y_continuous(trans = "identity") +
  labs(
    title = "Count Plot of Transaction Types by Fraud Status",
    x = "Transaction Type",
    y = "Number of Transactions",
    fill = "Fraud Status"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "right"
  ) +
  scale_fill_manual(
    values = c("FALSE" = "skyblue", "TRUE" = "#FF6666"),
    labels = c("Non-Fraudulent", "Fraudulent")
  )

3.2 Numerical Feature Distribution

Histograms for numerical columns

cols_of_interest <- c("amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")

data_filtered <- data[cols_of_interest]

for (col_name in cols_of_interest) {

  log_col_name <- paste0("Log_", col_name)
  data_filtered[[log_col_name]] <- log(data_filtered[[col_name]] + 1)

  p <- ggplot(data_filtered, aes(x = .data[[log_col_name]])) +
    geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
    labs(title = paste("Histogram of Log-transformed", col_name),
         x = paste("Log(", col_name, " + 1)", sep = ""),
         y = "Frequency") +
    theme_minimal()

  print(p)
}

3.3 Correlation Analysis

#correlation matrix using hierarchical clustering

num_cols <- sapply(data, is.numeric)
cor_matrix <- cor(data[, num_cols], use = "complete.obs")

## Warning in cor(data[, num_cols], use = "complete.obs"): the standard deviation
## is zero

cor_matrix <- cor_matrix[rownames(cor_matrix) != "type_CASH_IN", colnames(cor_matrix) != "type_CASH_IN"]

dist_matrix <- as.dist(1 - cor_matrix)

if (any(is.na(dist_matrix))) {
    stop("Distance matrix contains NA values.")
}
if (any(dist_matrix < 0, na.rm = TRUE)) {
    stop("Distance matrix contains negative values.")
}

hclust_order <- hclust(dist_matrix)$order

cor_matrix <- cor_matrix[hclust_order, hclust_order]


corrplot(
    cor_matrix,
    method = "circle",
    type = "full",
    tl.cex = 0.8,
    cl.cex = 0.8,
    tl.col = "black",
    addrect = 2
)

3.4 Balance Analysis

head(data)

##   step   amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1    1  9839.64      498        170136      160296.36      563              0
## 2    1  1864.28      727         21249       19384.72      613              0
## 3    1   181.00      293           181           0.00      124              0
## 4    1   181.00      977           181           0.00       92          21182
## 5    1 11668.14      202         41554       29885.86       10              0
## 6    1  7817.71      259         53860       46042.29      721              0
##   newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1              0   FALSE            0             0          0            1
## 2              0   FALSE            0             0          0            1
## 3              0    TRUE            0             0          0            0
## 4              0    TRUE            0             1          0            0
## 5              0   FALSE            0             0          0            1
## 6              0   FALSE            0             0          0            1
##   type_TRANSFER day period ratio_orig change_orig  ratio_dest change_dest
## 1             0   1  Night 0.05783397     9839.64 0.000000000           0
## 2             0   1  Night 0.08773495     1864.28 0.000000000           0
## 3             1   1  Night 1.00000000      181.00 0.000000000           0
## 4             0   1  Night 1.00000000      181.00 0.008544991       21182
## 5             0   1  Night 0.28079463    11668.14 0.000000000           0
## 6             0   1  Night 0.14514872     7817.71 0.000000000           0
##   change_diff
## 1     9839.64
## 2     1864.28
## 3      181.00
## 4   -21001.00
## 5    11668.14
## 6     7817.71

Scatter Plots

Investigate consistency of balances. First scatter plot help to understand the relationship between the sender’s original and new balances after transactions, also helps detect unusual or unexpected patterns, like abrupt zero balances or discrepancies. Second scatter plot aims to study how transactions impact the recipient’s account balance. And helps spot anomalies where balances remain unchanged (potential red flags for fraud or simulation.

# Scatter plot for oldbalanceOrg vs newbalanceOrig
plotorg <- ggplot(data, aes(x = oldbalanceOrg, y = newbalanceOrig)) +
  geom_point(alpha = 0.5, color = "blue") +
  labs(title = "Scatter Plot: oldbalanceOrg vs newbalanceOrig",
       x = "Old Balance Origin",
       y = "New Balance Origin") +
  theme_minimal()

print(plotorg)

# Scatter plot for oldbalanceDest vs newbalanceDest
plotdest <- ggplot(data, aes(x = oldbalanceDest, y = newbalanceDest)) +
  geom_point(alpha = 0.5, color = "lightgreen") +
  labs(title = "Scatter Plot: oldbalanceDest vs newbalanceDest",
       x = "Initial balance of recipient before the transaction",
       y = "New balance of recipient after the transaction") +
  theme_minimal()

# Print the second plot
print(plotdest)

3.5 Fraud Pattern Detection

To investigate fraud patterns by comparing amounts and balance changes for fraudulent and non-fraudulent transactions.

# Violin plot for Transaction Amount by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = amount, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  scale_y_log10() + # Use log scale for better visualization if amounts vary widely
  labs(title = "Transaction Amount Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Transaction Amount (log scale)") +
  theme_minimal() +
  theme(legend.position = "none")

The distribution of transaction amounts for both fraudulent and non-fraudulent transactions. The shape of the violin indicates the density (frequency) of transactions at various amount levels. Besides, log scale is used to better visualize the wide range of transaction amounts. If the fraudulent distribution is skewed towards higher amounts compared to non-fraudulent transactions, it suggests fraud often involves larger sums of money. And peaks (wider sections) indicate common transaction amounts. Other side, narrow tails in the non-fraudulent category might indicate a consistent range of typical transaction amounts, while broader tails for fraud might highlight its unpredictability.

# Violin plot for Balance Changes (oldbalanceOrg - newbalanceOrig) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = oldbalanceOrg - newbalanceOrig, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  labs(title = "Origin Balance Change Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Balance Change (oldbalanceOrg - newbalanceOrig)") +
  theme_minimal() +
  theme(legend.position = "none")

Non-Fraudulent Transactions (FALSE)

The distribution is very narrow, with most balance changes concentrated near small values (close to 0). This indicates that for non-fraudulent transactions, the sender’s balance changes are typically modest and predictable. The balance change pattern is consistent with legitimate transactions, where the deducted amount aligns with normal activities.
Fraudulent Transactions (TRUE)

The distribution is much broader, and balance changes can be extremely large, reaching values in the millions. This suggests that fraudulent transactions often involve significant deductions from the sender’s account. The sharp contrast in balance change magnitude highlights that fraud frequently involves outliers (large, irregular transactions).

# Violin plot for Balance Changes (newbalanceDest - oldbalanceDest) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = newbalanceDest - oldbalanceDest, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  labs(title = "Destination Balance Change Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Balance Change (newbalanceDest - oldbalanceDest)") +
  theme_minimal() +
  theme(legend.position = "none")

Non-Fraudulent Transactions (FALSE)

The distribution is very narrow, with balance changes clustered near zero. This indicates that in most legitimate transactions, recipients’ balances either increase modestly or stay consistent with expected transaction behavior. The low variability suggests predictable and consistent crediting in legitimate transactions.
Fraudulent Transactions (TRUE)

The distribution for fraudulent transactions also exhibits a narrow range, but there are significant outliers. These outliers indicate: Large negative balance changes, where credits do not align with the deducted amounts from the sender. Inconsistent or unusual behavior, possibly due to incomplete or simulated transactions.

4.0 Modeling

4.1 Classification Model: random forest

# Split data sets
data_rf <- data
data_rf$isFraud <- as.factor(data_rf$isFraud)
levels(data_rf$isFraud) <- c("NonFraud", "Fraud") 
set.seed(123)
train_index <- createDataPartition(data_rf$isFraud, p = 0.7, list = FALSE)
train_data <- data_rf[train_index, ]
test_data <- data_rf[-train_index, ]

#Building model 


n_features <- ncol(data_rf) - 1
# Setting up cross-validation controls for random searches
train_control_random <- trainControl(
  method = "cv",        
  number = 5,           
  classProbs = TRUE,    
  summaryFunction = twoClassSummary 
)
# Random search
set.seed(123)
rf_random_search <- train(
  isFraud ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  tuneLength = 10, 
  trControl = train_control_random,
  ntree = 100     
)

print(rf_random_search)

## Random Forest 
## 
## 267937 samples
##     20 predictor
##      2 classes: 'NonFraud', 'Fraud' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9999890  0.9999925  0.9417722
##    4    0.9999973  0.9999888  0.9544304
##    6    0.9999956  0.9999888  0.9518987
##    8    0.9999950  0.9999850  0.9518987
##   10    0.9999939  0.9999888  0.9518987
##   13    0.9999717  0.9999888  0.9518987
##   15    0.9949039  0.9999888  0.9518987
##   17    0.9949044  0.9999888  0.9518987
##   19    0.9873092  0.9999888  0.9518987
##   22    0.9885763  0.9999888  0.9518987
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

best_mtry <- rf_random_search$bestTune$mtry
cat("Best mtry from Random Search:", best_mtry, "\n")

## Best mtry from Random Search: 4

Due to the large amount of data, we first used random search to narrow down the tuning parameters. A grid search was then used to determine the exact ground parameters. This reduces runtime and improves efficiency.

# Grid Search
# Narrow down mtry based on random search results
rf_grid <- expand.grid(
  mtry = seq(max(1, best_mtry - 2), best_mtry + 2, by = 1) #Make sure mtry is not less than 1
)

# Setting up cross-validation controls for grid searches
train_control_grid <- trainControl(
  method = "cv",        # cross-validation
  number = 5,           # 5-fold 
  classProbs = TRUE,    
  summaryFunction = twoClassSummary # Use of ROC as an assessment indicator
)

# Grid Search
set.seed(123)
rf_grid_search <- train(
  isFraud ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  tuneGrid = rf_grid,
  trControl = train_control_grid,
  ntree = 100
)


print(rf_grid_search)

## Random Forest 
## 
## 267937 samples
##     20 predictor
##      2 classes: 'NonFraud', 'Fraud' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   2     0.9999898  0.9999925  0.9518987
##   3     0.9999943  0.9999888  0.9518987
##   4     0.9999965  0.9999888  0.9518987
##   5     0.9999958  0.9999888  0.9544304
##   6     0.9999974  0.9999850  0.9569620
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

cat("Best mtry from Grid Search:", rf_grid_search$bestTune$mtry, "\n")

## Best mtry from Grid Search: 6

# Model Evaluation
# Evaluate the final model on the test set
final_model <- rf_grid_search$finalModel
test_predictions <- predict(rf_grid_search, newdata = test_data, type = "prob")[, 2]

# Calculate AUC
test_roc <- roc(test_data$isFraud, test_predictions)

## Setting levels: control = NonFraud, case = Fraud

## Setting direction: controls < cases

cat("Test AUC:", auc(test_roc), "\n")

## Test AUC: 0.9970317

# Plotting ROC curves
plot(test_roc, main = "ROC Curve for Final Model")

As can be seen from the provided ROC curve graphs, the model’s classification performance is very strong. The curve is close to the upper left corner, indicating that the model performs well in distinguishing between positive (fraudulent) and negative (non-fraudulent) samples with high Sensitivity and Specificity. In addition, the area under the ROC curve (AUC) is expected to be close to 1, which further validates that the model’s classification ability is almost optimal under different thresholds. However, this near-perfect performance may require vigilance against the possibility of overfitting, especially if the test data is too similar to the training data distribution or there is a data leakage problem. A combination of confusion matrices, classification metrics (e.g., precision and recall), and further examination of the data distribution is needed to ensure the robustness and generalisation ability of the model.

# Confusion Matrix
test_predictions_class <- predict(rf_grid_search, newdata = test_data, type = "raw")
confusion_matrix <- confusionMatrix(test_predictions_class, test_data$isFraud)
print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NonFraud  Fraud
##   NonFraud   114659      2
##   Fraud           1    167
##                                              
##                Accuracy : 1                  
##                  95% CI : (0.9999, 1)        
##     No Information Rate : 0.9985             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9911             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.9882             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 0.9940             
##              Prevalence : 0.9985             
##          Detection Rate : 0.9985             
##    Detection Prevalence : 0.9985             
##       Balanced Accuracy : 0.9941             
##                                              
##        'Positive' Class : NonFraud           
##

From the confusion matrix and statistical metrics, the Random Forest model performed extremely well in the classification task. The confusion matrix shows that the model had very few misclassifications in the test set, with only 1 transaction in the NonFraud category being misclassified as Fraud and 2 in the Fraud category being misclassified as NonFraud. The overall Accuracy of the model is 1, with an almost perfect 95% confidence interval (0.9999, 1). The Sensitivity of 1 indicates that the model correctly detects all positive samples (Fraud), while the Specificity of 0.9882 indicates that the model is also very accurate in classifying negative samples (NonFraud). The Positive Pred Value (Pos Pred Value) of 1 indicates that all samples predicted as Fraud are true Fraud, while the Negative Pred Value (Neg Pred Value) of 0.994 indicates that most of the samples predicted as NonFraud are also accurate.

4.2 Regression Model

# Prepare data
data_logit <- data
data_logit$isFraud <- as.factor(data_logit$isFraud)
levels(data_logit$isFraud) <- c("NonFraud", "Fraud")

# Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(data_logit$isFraud, p = 0.7, list = FALSE)
train_data <- data_logit[train_index, ]
test_data <- data_logit[-train_index, ]

# Build logistic regression model
logit_model <- glm(isFraud ~ ., data = train_data, family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Display model summary
summary(logit_model)

## 
## Call:
## glm(formula = isFraud ~ ., family = binomial, data = train_data)
## 
## Coefficients: (4 not defined because of singularities)
##                                      Estimate                     Std. Error
## (Intercept)    -3022716009497732025600602.000          46921597383083344.000
## step                        4831344941913.332                      84405.344
## amount                        -1691543905.232                          3.742
## nameOrig                      11402541305.417                        448.652
## oldbalanceOrg                  3188407203.092                          3.872
## newbalanceOrig                -6547275677.137                          3.954
## nameDest                      -3267480095.920                        449.323
## oldbalanceDest                -1627130229.186                          2.150
## newbalanceDest                 1450415670.405                          2.149
## type_CASH_IN                               NA                             NA
## type_CASH_OUT   3022716009169918646824402.000          46921597382954680.000
## type_DEBIT      3022716008554150259400446.000          46921597382956824.000
## type_PAYMENT    3022716009263590273064066.000          46921597382957352.000
## type_TRANSFER   3022716009438239212484244.000          46921597382918688.000
## day                       -82512311989066.547                    2066016.713
## periodEvening              12482900810054.064                     546064.555
## periodMorning              49118748561660.062                     533547.603
## periodNight               141171689431696.531                    1425457.084
## ratio_orig                    -4433106998.311                        177.209
## change_orig                                NA                             NA
## ratio_dest                  -107865347518.556                        774.756
## change_dest                                NA                             NA
## change_diff                                NA                             NA
##                    z value            Pr(>|z|)    
## (Intercept)      -64420569 <0.0000000000000002 ***
## step              57239799 <0.0000000000000002 ***
## amount          -452035254 <0.0000000000000002 ***
## nameOrig          25415114 <0.0000000000000002 ***
## oldbalanceOrg    823467316 <0.0000000000000002 ***
## newbalanceOrig -1656025867 <0.0000000000000002 ***
## nameDest          -7272014 <0.0000000000000002 ***
## oldbalanceDest  -756909887 <0.0000000000000002 ***
## newbalanceDest   674897100 <0.0000000000000002 ***
## type_CASH_IN            NA                  NA    
## type_CASH_OUT     64420569 <0.0000000000000002 ***
## type_DEBIT        64420569 <0.0000000000000002 ***
## type_PAYMENT      64420569 <0.0000000000000002 ***
## type_TRANSFER     64420569 <0.0000000000000002 ***
## day              -39937872 <0.0000000000000002 ***
## periodEvening     22859753 <0.0000000000000002 ***
## periodMorning     92060668 <0.0000000000000002 ***
## periodNight       99036085 <0.0000000000000002 ***
## ratio_orig       -25016331 <0.0000000000000002 ***
## change_orig             NA                  NA    
## ratio_dest      -139224981 <0.0000000000000002 ***
## change_dest             NA                  NA    
## change_diff             NA                  NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5939.9  on 267936  degrees of freedom
## Residual deviance: 6127.4  on 267918  degrees of freedom
## AIC: 6165.4
## 
## Number of Fisher Scoring iterations: 25

# Make predictions on the test set
test_data$predicted_prob <- predict(logit_model, newdata = test_data, type = "response")

# Set threshold and classify predictions
threshold <- 0.5
test_data$predicted_class <- ifelse(test_data$predicted_prob > threshold, "Fraud", "NonFraud")

# Generate confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_data$predicted_class), as.factor(test_data$isFraud))

## Warning in confusionMatrix.default(as.factor(test_data$predicted_class), :
## Levels are not in the same order for reference and data. Refactoring data to
## match.

test_data$predicted_class <- factor(test_data$predicted_class, levels = levels(data_logit$isFraud))

conf_matrix <- confusionMatrix(
  test_data$predicted_class,
  test_data$isFraud
)  
conf_matrix_table <- as.table(conf_matrix)


conf_matrix_df <- as.data.frame(conf_matrix_table)




conf_matrix_df <- as.data.frame(conf_matrix_table)
ggplot(conf_matrix_df, aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "blue", name = "Frequency") +
  geom_text(aes(label = Freq), color = "white", size = 5) +
  labs(
    title = "Confusion Matrix Visualization",
    x = "Predicted Class",
    y = "Actual Class"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5))

# Calculate evaluation metrics
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)
misclassification_rate <- 1 - accuracy
specificity <- conf_matrix$byClass['Specificity']

# Display evaluation metrics in a table
metrics_df <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1 Score", "Misclassification Rate", "Specificity"),
  Value = c(accuracy, precision, recall, f1_score, misclassification_rate, specificity)
)

# Print evaluation metrics table
print(metrics_df)

##                   Metric        Value
## 1               Accuracy 0.9997735764
## 2              Precision 0.9998953409
## 3                 Recall 0.9998778999
## 4               F1 Score 0.9998866203
## 5 Misclassification Rate 0.0002264236
## 6            Specificity 0.9289940828

# McNemar's Test P-Value
mcnemar_test <- mcnemar.test(conf_matrix$table)
cat("McNemar's Test P-Value:", mcnemar_test$p.value, "\n")

## McNemar's Test P-Value: 0.8445193

# ROC Curve and AUC
roc_curve <- roc(test_data$isFraud, test_data$predicted_prob)

## Setting levels: control = NonFraud, case = Fraud

## Setting direction: controls < cases

auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")

## AUC: 0.964436

plot(roc_curve, main = "ROC Curve", col = "blue")

feature_importance <- summary(logit_model)$coefficients

importance_df <- data.frame(
  Feature = rownames(feature_importance),
  Coefficient = feature_importance[, 1],
  PValue = feature_importance[, 4]
)

importance_df <- importance_df[!grepl("Intercept", importance_df$Feature), ]
importance_df <- na.omit(importance_df)



ggplot(importance_df, aes(x = reorder(Feature, Coefficient), y = Coefficient, fill = Coefficient > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("skyblue", "lightcoral")) +
  labs(title = "Feature Importance (Coefficients) from Logistic Regression",
       x = "Features",
       y = "Coefficient Value") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10),
        plot.title = element_text(hjust = 0.5))

feature_importance <- summary(logit_model)$coefficients

importance_df <- data.frame(
  Feature = rownames(feature_importance),
  Coefficient = feature_importance[, 1],
  PValue = feature_importance[, 4]
)

importance_df <- importance_df[!grepl("Intercept", importance_df$Feature), ]
importance_df <- na.omit(importance_df)

feature_summary_df <- importance_df %>%
  mutate(Var1 = ifelse(Coefficient > 0, "Positive", "Negative")) %>%
  count(Var1) %>%
  rename(Freq = n) %>%
  mutate(Percentage = Freq / sum(Freq) * 100)

ggplot(feature_summary_df, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Distribution of Feature Coefficients (Positive vs Negative)", x = NULL, y = NULL) +
  theme_void() +
  scale_fill_manual(values = c("skyblue", "lightcoral")) +
  geom_text(aes(label = paste(Var1, "\n", round(Percentage, 1), "%")), position = position_stack(vjust = 0.5), size = 5, color = "white") +
  theme(plot.title = element_text(hjust = 0.5))

# Risk score and category assignment
test_data$risk_score <- test_data$predicted_prob
test_data$risk_category <- ifelse(test_data$predicted_prob > 0.7, "High Risk",
                                 ifelse(test_data$predicted_prob > 0.3, "Medium Risk", "Low Risk"))

# Display the first few rows of risk scores and categories
head(test_data[, c("predicted_prob", "risk_score", "risk_category")])

##              predicted_prob               risk_score risk_category
## 1  0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 2  0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 5  0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 16 0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 19 0.0000000000000002220446 0.0000000000000002220446      Low Risk
## 30 0.0000000000000002220446 0.0000000000000002220446      Low Risk

# Plot bar chart for risk categories
ggplot(test_data, aes(x = risk_category)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Distribution of Risk Categories", x = "Risk Category", y = "Count") +
  theme_minimal()

5.0 Result Discussion

5.1 Classification Model: random forest

The key objective was to build a strong machine learning model that would be able to classify a transaction with great accuracy into fraud or not fraud. It has been implemented by using the Random Forest algorithm due to its suitability for handling imbalance problems and relations among variables.

optimization

Random Search for Hyperparameter Tuning:

The search was done over a range of mtry values from 2 to 22 to find the best parameter setting.
With mtry = 4, the model achieved the best performance and recorded an ROC score of 0.9999973.

Grid Search for Hyperparameter Tuning:

An exhaustive search was conducted over the following mtry values: 2, 3, 4, 5, and 6.
Best setting found at mtry = 6, with a marginally better ROC score of 0.9999974. Grid Search ensured that the parameter space was thoroughly searched out and the model performance was fine-tuned.

Evaluation Methods We used 5-fold cross-validation to ensure that the model generalizes well and is not overfitting. This means we divide the data into five parts, train the model on four parts, and test it on the fifth. This process was repeated five times, and the results were averaged.

We evaluated the performance of the model using the following metrics:

1.ROC-AUC: It gives an idea of how well the model classifies fraud versus non-fraud. 2.Recall Sensitivity: The model does not miss any of the fraudulent cases. 3.Specificity: Ensuring that it does not mark valid transactions as fraudulent. 4.Accuracy: Total correct predictions in identifying fraud and nonfraudulent cases. 5.Confusion Matrix: It gives the picture which it would have gone right and the one where it didn’t perform very well, missing fraudulent or marking fraudulent a normal one. 6.Kappa Score: In order to understand how the model’s prediction actually matched with the real results, even by chance.

Results

Performance Metrics:

ROC-AUC: 0.997, indicating excellent discriminatory power.
Accuracy: 1.0 (95% CI: 0.9999–1).
Sensitivity: 1.0, ensuring no fraudulent transactions were missed.
Specificity: 0.9882, indicating high precision in identifying legitimate transactions.
Kappa: 0.9911, reflecting strong agreement between predictions and actual outcomes.

Confusion Matrix:

Correctly classified: 114,659 NonFraud and 167 Fraud transactions.
Misclassifications: Only 3 total (2 false negatives and 1 false positive).

Balanced Accuracy: 0.9941, showing excellent performance across both classes despite dataset imbalance.

5.2 Regression Model: Logistic regression

Optimization

Feature Engineering: Created additional risk categories (High Risk, Medium Risk, Low Risk) based on predicted probabilities.

Evaluation Methods

Accuracy: Overall correctness of the predictions.
Precision: Number of predicted fraud cases that were actual frauds.
Recall or Sensitivity: Total true fraud cases identified.
F1 Score: A balance between precision and recall for a single metric.
Specificity: Model correctly classifying the nonfraud cases.
AUC-ROC: This quantifies the power of discrimination by the model.
McNemar’s Test: Used to check significant differences in the classification errors.

Results:

Performance Metrics:

Accuracy: 0.999, reflecting near-perfect overall prediction.
Precision: 0.999, indicating excellent fraud identification with very few false positives.
Recall (Sensitivity): 0.9999, confirming the model’s ability to capture almost all fraudulent transactions.
F1 Score: 0.999, balancing precision and recall effectively.
Specificity: 0.9231, suggesting good but not perfect classification of non-fraud cases.
AUC-ROC: 0.9289, indicating high discriminatory power to separate fraud from non-fraud.

2.Misclassification Rate:

Only 0.022% of transactions were misclassified, ensuring reliability.

Risk Categorization:

Transactions were categorized into High Risk, Medium Risk, and Low Risk based on predicted probabilities: - High Risk: Predicted probability > 0.7. - Medium Risk: 0.3 < Predicted probability ≤ 0.7. - Low Risk: Predicted probability ≤ 0.3.

McNemar’s Test P-Value: 0.84451, suggesting no significant difference in the errors of fraud and non-fraud classifications.

feature Importance

Most Important Feature:

type_TRANSFER,CASH_OUT, type_DEBIT, type_PAYMENT are the most important feature in fraud detection because of their highest positive coefficient.

#6.0 Conclusion

This project was carried out mainly for the classification of a transaction into fraudulent or nonfraudulent with most contributing features for fraud risk scores. Our machine learning model both Random Forest and Logistic Regression achieved the two fold objectives and showed the efficiency of the models in tackling challenges for fraud detection.

1. Fraud Detection with Accuracy

The Random Forest model proved to be the best in classifying fraudulent transactions even from an imbalanced dataset. Key results of this study are as follows:

Using the proper setting of hyperparameters using Random Search in addition to Grid Search yielded the best setting of performance. Finally, the proposed model had great evaluation metrics: impressively, its ROC-AUC is 0.997, Sensitivity of 1.0, and a Balanced Accuracy of 0.9941 is among many excellent separation characteristics for this model to classify between fraudulent versus nonfraudulent transactions. This is further supported by the confusion matrix, which shows 99.998% of the transactions correctly classified and only three misclassifications recorded (2 false negatives and 1 false positive). Furthermore, the robustness of the model was checked using 5-fold cross-validation, which confirmed its generalization capability while keeping the overfitting risk as low as possible.

Impact: The Random Forest model was very efficient in identifying fraudulent transactions with extraordinary precision and recall to ensure minimal false positives and negatives.

2. Feature Identification and Risk Prediction

Logistic Regression, on the other hand, intends to reach the most discriminant features that significantly contribute to fraud detection by predicting a risk score. Among the key outcomes are:

The predicted probabilities effectively stratified the transaction risk levels into High, Medium, and Low, thus providing a clear risk assessment framework. The model’s performance on fraud detection was well-balanced with minimal false positives, underlined by an excellent precision, recall, and F1 score of 0.9999. Feature importance analysis gave the transfer feature type with the highest positive coefficient and, therefore, had the highest implication in fraud prediction. Type PAYMENT and type DEBIT types also showed strong values but, interestingly enough, had a negative coefficient to suggest reduced fraudness. Model reliability was satisfactory at low misclassification of only 0.022% for all transaction data. In addition, by McNemar’s Test, the p-value is 0.84451, which showed that the fraud-non-fraud classification of error rates did not vary from each other, hence reinforcing further the model’s trustworthiness.

Impact: Logistic Regression provided actionable insights into the most critical features driving fraudulent transactions. It adds value in the risk categorization framework for the stakeholders to focus on monitoring and intervention based on the level of transaction risks.

Achievement regarding the Goals in the Problem Statement

Fraud Classification Accurately:

The model, using a Random Forest classifier, classified fraudulent versus non-fraudulent transactions with high accuracy, sensitivity, and specificity.

Feature Identification and Risk Prediction:

Among others, type_TRANSFER,CASH_OUT, type_DEBIT, type_PAYMENT were identified by Logistic Regression as the most important feature, providing a robust framework for fraud risk stratification by predicted probabilities.

WQD7004 Group Assignment

Group 9

2025-01-10