Detecting Online Payment Fraud Using Machine Learning Models

Lecturer: Dr. Ang Group 9

Matric Full Name
23121328 Mohammed Iqram
24052516 LI JUNMING
22106713 LI YUEXIN
23108677 ZHAO ZITONG
23111676 LIU YICONG

Project Background

Fraud detection has become highly important in today’s world, when most of us depend on digital payments and online transactions. Many people buy and sell many things in the virtual world, which increases the chances of fraud. Losses in the virtual world due to fraud now reach billions of dollars each year. This affects the bottom line of businesses and reduces customer confidence and trust. Hence, companies work hard and Pursue all possibilities to prevent fraud so that all types of service users remain safe.

There are significantly more genuine transactions than fraudulent ones, resulting in an imbalance that makes it difficult to identify the fraudulent transactions. This is one of the main issues with fraud detection. Additionally, fraudsters are always changing strategies, so systems must get smarter in order to recognize them. Therefore, it will be crucial to identify the patterns or indications that point to fraud and utilize machine learning to develop predictive systems that can aid in its detection and help in prevention.

This study aims to identify the underlying patterns of fraudulent activities and build a machine learning model that is capable of accurately classifying transactions as fraudulent or non-fraudulent.

Problem Statement:

  1. To accurately classify a transaction as fraudulent or non-fraudulent using machine learning models?

Proposed Model: Random Forest & XgBoost

  1. What are the key factors influencing the amount of fraudulent transactions?

Proposed Model: Lasso Regression, Linear Regression, and Polynomial Regression

1.0 The Details of Raw data

rawdata <- read.csv('dirty_data.csv') # here we are importing the dataset
head(rawdata) # here checking  first six rows of rawdata
##   step     type   amount    nameOrig oldbalanceOrg newbalanceOrig    nameDest
## 1    1  PAYMENT  9839.64 C1231006815        170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295         21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145           181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671           181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720         41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638         53860       46042.29  M573487274
##   oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## 1              0              0       0              0
## 2              0              0       0              0
## 3              0              0       1              0
## 4          21182              0       1              0
## 5              0              0       0              0
## 6              0              0       0              0

Data Information

Year: 2022
Purpose of Dataset: Online payment fraud detection by machine learning
Dimension:
- Number of rows: 1,048,575
- Number of columns: 11

Data Size: 493.53 MB

Dataset Link: Online Payment Fraud Detection on Kaggle

Data Description:

Column Name Description
step Represents a unit of time where 1 step equals 1 hour
type Type of online transaction
amount The amount of the transaction
nameOrig Customer starting the transaction
oldbalanceOrg Balance before the transaction
newbalanceOrig Balance after the transaction
nameDest Recipient of the transaction
oldbalanceDest Initial balance of recipient before the transaction
newbalanceDest New balance of recipient after the transaction
isFraud Indicates whether the transaction is fraudulent (1 = Fraud)
# Checking the dimension of the dataset
dim(rawdata)
## [1] 1048575      11
# knowing the structure of the dataset
str(rawdata)
## 'data.frame':    1048575 obs. of  11 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : chr  "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ isFlaggedFraud: int  0 0 0 0 0 0 0 0 0 0 ...
# knowing the summary of the dataset
summary(rawdata)
##       step           type               amount           nameOrig        
##  Min.   : 1.00   Length:1048575     Min.   :       0   Length:1048575    
##  1st Qu.:15.00   Class :character   1st Qu.:   12149   Class :character  
##  Median :20.00   Mode  :character   Median :   76343   Mode  :character  
##  Mean   :26.97                      Mean   :  158667                     
##  3rd Qu.:39.00                      3rd Qu.:  213762                     
##  Max.   :95.00                      Max.   :10000000                     
##                                     NA's   :6                            
##  oldbalanceOrg      newbalanceOrig       nameDest         oldbalanceDest    
##  Min.   :       0   Min.   :       0   Length:1048575     Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   Class :character   1st Qu.:       0  
##  Median :   16002   Median :       0   Mode  :character   Median :  126377  
##  Mean   :  874010   Mean   :  893809                      Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600                      3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000                      Max.   :42100000  
##                                                                             
##  newbalanceDest        isFraud         isFlaggedFraud
##  Min.   :       0   Min.   :0.000000   Min.   :0     
##  1st Qu.:       0   1st Qu.:0.000000   1st Qu.:0     
##  Median :  218260   Median :0.000000   Median :0     
##  Mean   : 1114198   Mean   :0.001089   Mean   :0     
##  3rd Qu.: 1149808   3rd Qu.:0.000000   3rd Qu.:0     
##  Max.   :42200000   Max.   :1.000000   Max.   :0     
## 

2.0 Data Cleaning

data <- rawdata # for data cleaning purpose

2.1 Drop no meaning column

data <- rawdata %>% select(-"isFlaggedFraud")  # df <- df %>% select(-B)
head(data)
##   step     type   amount    nameOrig oldbalanceOrg newbalanceOrig    nameDest
## 1    1  PAYMENT  9839.64 C1231006815        170136      160296.36 M1979787155
## 2    1  PAYMENT  1864.28 C1666544295         21249       19384.72 M2044282225
## 3    1 TRANSFER   181.00 C1305486145           181           0.00  C553264065
## 4    1 CASH_OUT   181.00  C840083671           181           0.00   C38997010
## 5    1  PAYMENT 11668.14 C2048537720         41554       29885.86 M1230701703
## 6    1  PAYMENT  7817.71   C90045638         53860       46042.29  M573487274
##   oldbalanceDest newbalanceDest isFraud
## 1              0              0       0
## 2              0              0       0
## 3              0              0       1
## 4          21182              0       1
## 5              0              0       0
## 6              0              0       0

2.2 Check for Null Values

colSums(is.na(data))
##           step           type         amount       nameOrig  oldbalanceOrg 
##              0              0              6              0              0 
## newbalanceOrig       nameDest oldbalanceDest newbalanceDest        isFraud 
##              0              0              0              0              0

It shows that the Amount Column has 6 Null values

2.3 Fill the Blank Values

data$amount[is.na(data$amount)] <-
  data$oldbalanceOrg[is.na(data$amount)] - data$newbalanceOrig[is.na(data$amount)]

To fill missing values in the amount column, use the logical relationship:

amount=oldbalanceOrg−newbalanceOrig

If amount is missing, it can be computed by subtracting newbalanceOrig from oldbalanceOrg to maintain consistency in the data.

sum(is.na(data))
## [1] 0

There are no blank values

2.4 Check for Duplicates

sum(duplicated(data))
## [1] 0

No duplicates found

2.5 Data Type Validation

str(data)
## 'data.frame':    1048575 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : chr  "PAYMENT" "PAYMENT" "TRANSFER" "CASH_OUT" ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...

2.6 data type conversions

data$type <- as.factor(data$type)
data$isFraud <- as.logical(data$isFraud)
data$amount <- as.numeric(data$amount)
data$oldbalanceOrg <- as.numeric(data$oldbalanceOrg)
data$newbalanceOrig <- as.numeric(data$newbalanceOrig)
data$oldbalanceDest <- as.numeric(data$oldbalanceDest)
data$newbalanceDest <- as.numeric(data$newbalanceDest)
summary(data)
##       step             type            amount           nameOrig        
##  Min.   : 1.00   CASH_IN :227130   Min.   : -227335   Length:1048575    
##  1st Qu.:15.00   CASH_OUT:373641   1st Qu.:   12149   Class :character  
##  Median :20.00   DEBIT   :  7178   Median :   76342   Mode  :character  
##  Mean   :26.97   PAYMENT :353873   Mean   :  158666                     
##  3rd Qu.:39.00   TRANSFER: 86753   3rd Qu.:  213762                     
##  Max.   :95.00                     Max.   :10000000                     
##  oldbalanceOrg      newbalanceOrig       nameDest         oldbalanceDest    
##  Min.   :       0   Min.   :       0   Length:1048575     Min.   :       0  
##  1st Qu.:       0   1st Qu.:       0   Class :character   1st Qu.:       0  
##  Median :   16002   Median :       0   Mode  :character   Median :  126377  
##  Mean   :  874010   Mean   :  893809                      Mean   :  978160  
##  3rd Qu.:  136642   3rd Qu.:  174600                      3rd Qu.:  915923  
##  Max.   :38900000   Max.   :38900000                      Max.   :42100000  
##  newbalanceDest      isFraud       
##  Min.   :       0   Mode :logical  
##  1st Qu.:       0   FALSE:1047433  
##  Median :  218260   TRUE :1142     
##  Mean   : 1114198                  
##  3rd Qu.: 1149808                  
##  Max.   :42200000

2.7 Check of Outlier

boxplot(data$amount, main = "Boxplot for Amount", ylab = "Amount")

boxplot(data$oldbalanceOrg, main = "Boxplot for Old Balance Origin", ylab = "Old Balance Origin")

boxplot(data$newbalanceOrig, main = "Boxplot for New Balance Origin", ylab = "New Balance Origin")

boxplot(data$oldbalanceDest, main = "Boxplot for Old Balance Des", ylab = "Old Balance Des")

boxplot(data$newbalanceDest, main = "Boxplot for New Balance Des", ylab = "New Balance Des")

From the given boxplots, some variables like amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, and newbalanceDest still have outliers. I haven’t removed them because those are unusual but valid transactions. In financial data, especially fraud detection, outliers are quite common and can be for potentially fraudulent or high-value transactions. Therefore, retaining these outliers will result in better model accuracy and insight.

2.8 Ensure Balance Consistency

# Identify inconsistent rows
inconsistent_rows <- data[data$newbalanceOrig > data$oldbalanceOrg |
                          data$newbalanceDest > data$oldbalanceDest, ]
cat("Inconsistent rows:", nrow(inconsistent_rows), "\n")
## Inconsistent rows: 661442
# Keep only consistent rows
data <- data[data$newbalanceOrig <= data$oldbalanceOrg &
             data$newbalanceDest <= data$oldbalanceDest, ]
cat(all(data$newbalanceOrig <= data$oldbalanceOrg), "\n")
## TRUE
cat(all(data$newbalanceDest <= data$oldbalanceDest), "\n")
## TRUE
str(data)
## 'data.frame':    387133 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : logi  FALSE FALSE TRUE TRUE FALSE FALSE ...
# Keep only consistent rows:38,7133

2.9 Categorical Data Validation

unique(data$type)
## [1] PAYMENT  TRANSFER CASH_OUT DEBIT    CASH_IN 
## Levels: CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
# unique(data$nameOrig)
# unique(data$nameDest)

Drop rows with type:“CASH_IN”, which is unexpected type to detect fraud

data <- data[data$type != "CASH_IN", ]
cat("Frequency table for 'type':\n")
## Frequency table for 'type':
print(table(data$type))
## 
##  CASH_IN CASH_OUT    DEBIT  PAYMENT TRANSFER 
##        0    24731     1036   353873     3126
str(data)
## 'data.frame':    382766 obs. of  10 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : chr  "C1231006815" "C1666544295" "C1305486145" "C840083671" ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : chr  "M1979787155" "M2044282225" "C553264065" "C38997010" ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : logi  FALSE FALSE TRUE TRUE FALSE FALSE ...

after droped “CASH_IN”: 38,7133 observations >> 38,2766 observations

colnames(data)
##  [1] "step"           "type"           "amount"         "nameOrig"      
##  [5] "oldbalanceOrg"  "newbalanceOrig" "nameDest"       "oldbalanceDest"
##  [9] "newbalanceDest" "isFraud"

2.10 One-Hot Encoding for categorical variables(type)

data <- dummy_cols(
  data,
  select_columns = c("type"),
  remove_selected_columns = TRUE
)

Hashing Encoding for categorical variables (nameOrig, nameDest) to convert them into numeric format.

data$nameOrig <- as.integer(sapply(data$nameOrig, function(x) {
  hash_value <- digest(x, algo = "xxhash32", seed = 123)
  as.numeric(paste0("0x", hash_value)) %% 1000 #num_buckets could change based on model
}))
data$nameDest <- as.integer(sapply(data$nameDest, function(x) {
  hash_value <- digest(x, algo = "xxhash32", seed = 123)
  as.numeric(paste0("0x", hash_value)) %% 1000
}))

2.11 Feature Engineering

# Convert steps into days and create bins for different time periods
data <- data %>%
  mutate(day = step %/% 24 + 1 )%>%
  mutate(period = case_when(
    step %% 24 >= 0 & step %% 24 < 6 ~ "Night",
    step %% 24 >= 6 & step %% 24 < 12 ~ "Morning",
    step %% 24 >= 12 & step %% 24 < 18 ~ "Afternoon",
    TRUE ~ "Evening"
  ))

# Calculate ratio of transaction amount to the old balance and balance change
# For the originating account
data <- data %>%
  mutate(ratio_orig = ifelse(oldbalanceOrg > 0, amount / oldbalanceOrg, 0))%>%
  mutate(change_orig = oldbalanceOrg - newbalanceOrig)

# For the destination account
data <- data %>%
  mutate(ratio_dest = ifelse(oldbalanceDest > 0, amount / oldbalanceDest, 0)) %>%
  mutate(change_dest = oldbalanceDest - newbalanceDest)

# Difference between origin and destination balance change
data <- data %>%
  mutate(change_diff = change_orig - change_dest)

we converted steps into days and created time period bins (Night, Morning, Afternoon, Evening) based on the step value. We then calculated the ratio of transaction amount to the old balance and balance change for both originating and destination accounts, followed by calculating the difference in balance changes between the origin and destination accounts.

3.0 Exploratory Data Analysis(EDA)

head(data)
##   step   amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1    1  9839.64      498        170136      160296.36      563              0
## 2    1  1864.28      727         21249       19384.72      613              0
## 3    1   181.00      293           181           0.00      124              0
## 4    1   181.00      977           181           0.00       92          21182
## 5    1 11668.14      202         41554       29885.86       10              0
## 6    1  7817.71      259         53860       46042.29      721              0
##   newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1              0   FALSE            0             0          0            1
## 2              0   FALSE            0             0          0            1
## 3              0    TRUE            0             0          0            0
## 4              0    TRUE            0             1          0            0
## 5              0   FALSE            0             0          0            1
## 6              0   FALSE            0             0          0            1
##   type_TRANSFER day period ratio_orig change_orig  ratio_dest change_dest
## 1             0   1  Night 0.05783397     9839.64 0.000000000           0
## 2             0   1  Night 0.08773495     1864.28 0.000000000           0
## 3             1   1  Night 1.00000000      181.00 0.000000000           0
## 4             0   1  Night 1.00000000      181.00 0.008544991       21182
## 5             0   1  Night 0.28079463    11668.14 0.000000000           0
## 6             0   1  Night 0.14514872     7817.71 0.000000000           0
##   change_diff
## 1     9839.64
## 2     1864.28
## 3      181.00
## 4   -21001.00
## 5    11668.14
## 6     7817.71

3.1 The distribution of isFraud

Plot of Transaction Types by Fraud Status

options(scipen = 999)
plot_data <- data %>%
  group_by(isFraud) %>%
  summarise(
    CASH_OUT = sum(type_CASH_OUT),
    DEBIT = sum(type_DEBIT),
    PAYMENT = sum(type_PAYMENT),
    TRANSFER = sum(type_TRANSFER)
  )


plot_data_long <- plot_data %>%
  pivot_longer(
    cols = c(CASH_OUT, DEBIT, PAYMENT, TRANSFER),
    names_to = "Transaction_Type",
    values_to = "Value"
  )


plot_data_long$isFraud <- as.factor(plot_data_long$isFraud)


ggplot(plot_data_long, aes(x = Transaction_Type, y = Value, fill = isFraud)) +
  geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8), color = "black") +
  geom_text(aes(label = Value),
            vjust = -0.5,
            size = 3,
            position = position_dodge(width = 0.8),
            color = "black") +
  scale_y_continuous(trans = "identity") +
  labs(
    title = "Count Plot of Transaction Types by Fraud Status",
    x = "Transaction Type",
    y = "Number of Transactions",
    fill = "Fraud Status"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "right"
  ) +
  scale_fill_manual(
    values = c("FALSE" = "skyblue", "TRUE" = "#FF6666"),
    labels = c("Non-Fraudulent", "Fraudulent")
  )

Insight:

  • Most transactions are non-fraudulent.
  • Fraud is more common in the TRANSFER and CASH_OUT transaction types, with TRANSFER showing a relatively higher fraud ratio.
  • The PAYMENT type dominates in volume but has no fraudulent cases.

3.2 Numerical Feature Distribution

Histograms for numerical columns

cols_of_interest <- c("amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")

data_filtered <- data[cols_of_interest]

for (col_name in cols_of_interest) {

  log_col_name <- paste0("Log_", col_name)
  data_filtered[[log_col_name]] <- log(data_filtered[[col_name]] + 1)

  p <- ggplot(data_filtered, aes(x = .data[[log_col_name]])) +
    geom_histogram(binwidth = 0.5, fill = "lightgreen", color = "black", alpha = 0.7) +
    labs(title = paste("Histogram of Log-transformed", col_name),
         x = paste("Log(", col_name, " + 1)", sep = ""),
         y = "Frequency") +
    theme_minimal()

  print(p)
}

Insights:

Log-Transformed Transaction Amount:

  • The distribution of the log-transformed transaction amounts is approximately normal.
  • The majority of transaction amounts are clustered around the log value of 10, suggesting that most transactions fall within a similar monetary range.

Log-Transformed Old Balance Origin (oldbalanceOrg):

  • The distribution shows a sharp peak at the log value of 0, indicating a significant number of transactions where the old balance was zero.
  • Apart from this, there is a secondary distribution centered around the log value of 10, reflecting a range of balances typical for most users.

Log-Transformed New Balance Origin (newbalanceOrig):

  • The distribution is right-skewed, with the majority of the log-transformed newbalanceOrig values clustered towards the lower end of the scale.
  • A prominent peak around the log value of 0 indicates that many of the original newbalanceOrig values are close to 1.
  • This skewness suggests the original data contained a few extremely high values, which may need careful handling in statistical analysis.

3.3 Correlation Analysis

#correlation matrix using hierarchical clustering

num_cols <- sapply(data, is.numeric)
cor_matrix <- cor(data[, num_cols], use = "complete.obs")
## Warning in cor(data[, num_cols], use = "complete.obs"): the standard deviation
## is zero
cor_matrix <- cor_matrix[rownames(cor_matrix) != "type_CASH_IN", colnames(cor_matrix) != "type_CASH_IN"]

dist_matrix <- as.dist(1 - cor_matrix)

if (any(is.na(dist_matrix))) {
    stop("Distance matrix contains NA values.")
}
if (any(dist_matrix < 0, na.rm = TRUE)) {
    stop("Distance matrix contains negative values.")
}

hclust_order <- hclust(dist_matrix)$order

cor_matrix <- cor_matrix[hclust_order, hclust_order]


corrplot(
    cor_matrix,
    method = "circle",
    type = "full",
    tl.cex = 0.8,
    cl.cex = 0.8,
    tl.col = "black",
    addrect = 2
)

Insights

Strong correlations:

  • newbalanceDest and oldbalanceDest
  • newbalanceOrig and oldbalanceOrg

3.4 Balance Analysis

head(data)
##   step   amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## 1    1  9839.64      498        170136      160296.36      563              0
## 2    1  1864.28      727         21249       19384.72      613              0
## 3    1   181.00      293           181           0.00      124              0
## 4    1   181.00      977           181           0.00       92          21182
## 5    1 11668.14      202         41554       29885.86       10              0
## 6    1  7817.71      259         53860       46042.29      721              0
##   newbalanceDest isFraud type_CASH_IN type_CASH_OUT type_DEBIT type_PAYMENT
## 1              0   FALSE            0             0          0            1
## 2              0   FALSE            0             0          0            1
## 3              0    TRUE            0             0          0            0
## 4              0    TRUE            0             1          0            0
## 5              0   FALSE            0             0          0            1
## 6              0   FALSE            0             0          0            1
##   type_TRANSFER day period ratio_orig change_orig  ratio_dest change_dest
## 1             0   1  Night 0.05783397     9839.64 0.000000000           0
## 2             0   1  Night 0.08773495     1864.28 0.000000000           0
## 3             1   1  Night 1.00000000      181.00 0.000000000           0
## 4             0   1  Night 1.00000000      181.00 0.008544991       21182
## 5             0   1  Night 0.28079463    11668.14 0.000000000           0
## 6             0   1  Night 0.14514872     7817.71 0.000000000           0
##   change_diff
## 1     9839.64
## 2     1864.28
## 3      181.00
## 4   -21001.00
## 5    11668.14
## 6     7817.71

Scatter Plots

Investigate consistency of balances. First scatter plot help to understand the relationship between the sender’s original and new balances after transactions, also helps detect unusual or unexpected patterns, like abrupt zero balances or discrepancies. Second scatter plot aims to study how transactions impact the recipient’s account balance. And helps spot anomalies where balances remain unchanged (potential red flags for fraud or simulation.

# Scatter plot for oldbalanceOrg vs newbalanceOrig
plotorg <- ggplot(data, aes(x = oldbalanceOrg, y = newbalanceOrig)) +
  geom_point(alpha = 0.5, color = "blue") +
  labs(title = "Scatter Plot: oldbalanceOrg vs newbalanceOrig",
       x = "Old Balance Origin",
       y = "New Balance Origin") +
  theme_minimal()

print(plotorg)

# Scatter plot for oldbalanceDest vs newbalanceDest
plotdest <- ggplot(data, aes(x = oldbalanceDest, y = newbalanceDest)) +
  geom_point(alpha = 0.5, color = "lightgreen") +
  labs(title = "Scatter Plot: oldbalanceDest vs newbalanceDest",
       x = "Initial balance of recipient before the transaction",
       y = "New balance of recipient after the transaction") +
  theme_minimal()

# Print the second plot
print(plotdest)

3.5 Fraud Pattern Detection

To investigate fraud patterns by comparing amounts and balance changes for fraudulent and non-fraudulent transactions.

# Violin plot for Transaction Amount by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = amount, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  scale_y_log10() + # Use log scale for better visualization if amounts vary widely
  labs(title = "Transaction Amount Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Transaction Amount (log scale)") +
  theme_minimal() +
  theme(legend.position = "none")

The distribution of transaction amounts for both fraudulent and non-fraudulent transactions. The shape of the violin indicates the density (frequency) of transactions at various amount levels. Besides, log scale is used to better visualize the wide range of transaction amounts. If the fraudulent distribution is skewed towards higher amounts compared to non-fraudulent transactions, it suggests fraud often involves larger sums of money. And peaks (wider sections) indicate common transaction amounts. Other side, narrow tails in the non-fraudulent category might indicate a consistent range of typical transaction amounts, while broader tails for fraud might highlight its unpredictability.

# Violin plot for Balance Changes (oldbalanceOrg - newbalanceOrig) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = oldbalanceOrg - newbalanceOrig, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  labs(title = "Origin Balance Change Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Balance Change (oldbalanceOrg - newbalanceOrig)") +
  theme_minimal() +
  theme(legend.position = "none")

  • Non-Fraudulent Transactions (FALSE)

    The distribution is very narrow, with most balance changes concentrated near small values (close to 0). This indicates that for non-fraudulent transactions, the sender’s balance changes are typically modest and predictable. The balance change pattern is consistent with legitimate transactions, where the deducted amount aligns with normal activities.

  • Fraudulent Transactions (TRUE)

    The distribution is much broader, and balance changes can be extremely large, reaching values in the millions. This suggests that fraudulent transactions often involve significant deductions from the sender’s account. The sharp contrast in balance change magnitude highlights that fraud frequently involves outliers (large, irregular transactions).

# Violin plot for Balance Changes (newbalanceDest - oldbalanceDest) by Fraud Status
ggplot(data, aes(x = as.factor(isFraud), y = newbalanceDest - oldbalanceDest, fill = as.factor(isFraud))) +
  geom_violin(trim = FALSE) +
  labs(title = "Destination Balance Change Distribution by Fraud Status",
       x = "Fraud Status (0 = Non-Fraud, 1 = Fraud)",
       y = "Balance Change (newbalanceDest - oldbalanceDest)") +
  theme_minimal() +
  theme(legend.position = "none")

  • Non-Fraudulent Transactions (FALSE)

    The distribution is very narrow, with balance changes clustered near zero. This indicates that in most legitimate transactions, recipients’ balances either increase modestly or stay consistent with expected transaction behavior. The low variability suggests predictable and consistent crediting in legitimate transactions.

  • Fraudulent Transactions (TRUE)

    The distribution for fraudulent transactions also exhibits a narrow range, but there are significant outliers. These outliers indicate: Large negative balance changes, where credits do not align with the deducted amounts from the sender. Inconsistent or unusual behavior, possibly due to incomplete or simulated transactions.

4.0 Modeling

4.1 Classification Model: random forest

# Split data sets
data_rf <- data
data_rf$isFraud <- as.factor(data_rf$isFraud)
levels(data_rf$isFraud) <- c("NonFraud", "Fraud") 
set.seed(123)
train_index <- createDataPartition(data_rf$isFraud, p = 0.7, list = FALSE)
train_data <- data_rf[train_index, ]
test_data <- data_rf[-train_index, ]
# Train the random forest model
set.seed(123)
rf_model <- randomForest(isFraud ~ ., data = train_data, ntree = 100, importance = TRUE)


rf_predictions <- predict(rf_model, newdata = test_data, type = "class")



rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$isFraud)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NonFraud  Fraud
##   NonFraud   114659      3
##   Fraud           1    166
##                                              
##                Accuracy : 1                  
##                  95% CI : (0.9999, 1)        
##     No Information Rate : 0.9985             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9881             
##                                              
##  Mcnemar's Test P-Value : 0.6171             
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.9822             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 0.9940             
##              Prevalence : 0.9985             
##          Detection Rate : 0.9985             
##    Detection Prevalence : 0.9985             
##       Balanced Accuracy : 0.9911             
##                                              
##        'Positive' Class : NonFraud           
## 
# Convert training and test sets to matrix format


X_train <- model.matrix(isFraud ~ . - 1, data = train_data)
X_test <- model.matrix(isFraud ~ . - 1, data = test_data)


class(X_train)  # should be "matrix"
## [1] "matrix" "array"
y_train <- as.numeric(train_data$isFraud) - 1  # Converting factors to 0 and 1
y_test <- as.numeric(test_data$isFraud) - 1
# Setting parameters
params <- list(
  objective = "binary:logistic", 
  eval_metric = "error",         
  max_depth = 6,                
  eta = 0.3                      
)

# Training the XGBoost model
set.seed(123)
xgb_model <- xgboost(
  data = X_train,
  label = y_train,
  params = params,
  nrounds = 100,
  verbose = 0
)


xgb_predictions <- predict(xgb_model, newdata = X_test)
xgb_pred_labels <- ifelse(xgb_predictions > 0.5, "Fraud", "NonFraud")

# Evaluate XGBoost models
xgb_conf_matrix <- confusionMatrix(factor(xgb_pred_labels, levels = c("NonFraud", "Fraud")), test_data$isFraud)
print(xgb_conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NonFraud  Fraud
##   NonFraud   114660      4
##   Fraud           0    165
##                                              
##                Accuracy : 1                  
##                  95% CI : (0.9999, 1)        
##     No Information Rate : 0.9985             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.988              
##                                              
##  Mcnemar's Test P-Value : 0.1336             
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.9763             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 1.0000             
##              Prevalence : 0.9985             
##          Detection Rate : 0.9985             
##    Detection Prevalence : 0.9986             
##       Balanced Accuracy : 0.9882             
##                                              
##        'Positive' Class : NonFraud           
## 
rf_accuracy <- rf_conf_matrix$overall["Accuracy"]
rf_sensitivity <- rf_conf_matrix$byClass["Sensitivity"]
rf_specificity <- rf_conf_matrix$byClass["Specificity"]


xgb_accuracy <- xgb_conf_matrix$overall["Accuracy"]
xgb_sensitivity <- xgb_conf_matrix$byClass["Sensitivity"]
xgb_specificity <- xgb_conf_matrix$byClass["Specificity"]

# Creating a Performance Comparison Table
performance_comparison <- data.frame(
  Model = c("Random Forest", "XGBoost"),
  Accuracy = c(rf_accuracy, xgb_accuracy),
  Sensitivity = c(rf_sensitivity, xgb_sensitivity),
  Specificity = c(rf_specificity, xgb_specificity)
)

print(performance_comparison)
##           Model  Accuracy Sensitivity Specificity
## 1 Random Forest 0.9999652   0.9999913   0.9822485
## 2       XGBoost 0.9999652   1.0000000   0.9763314
performance_long <- melt(performance_comparison, id.vars = "Model", variable.name = "Metric", value.name = "Value")

# Plotting Performance Comparison Bar Graphs
ggplot(performance_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  labs(title = "Model Performance Comparison", x = "Metric", y = "Value") +
  theme_minimal()

# Feature Importance
rf_importance <- importance(rf_model)
rf_importance <- data.frame(Feature = rownames(rf_importance), Importance = rf_importance[, 1])
rf_importance <- rf_importance[order(-rf_importance$Importance), ]


library(ggplot2)
ggplot(rf_importance, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Feature Importance from Random Forest", x = "Features", y = "Importance") +
  theme_minimal()

#Building model 


n_features <- ncol(data_rf) - 1
# Setting up cross-validation controls for random searches
train_control_random <- trainControl(
  method = "cv",        
  number = 5,           
  classProbs = TRUE,    
  summaryFunction = twoClassSummary 
)
# Random search
set.seed(123)
rf_random_search <- train(
  isFraud ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  tuneLength = 10, 
  trControl = train_control_random,
  ntree = 100     
)

print(rf_random_search)
## Random Forest 
## 
## 267937 samples
##     20 predictor
##      2 classes: 'NonFraud', 'Fraud' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9999890  0.9999925  0.9417722
##    4    0.9999973  0.9999888  0.9544304
##    6    0.9999956  0.9999888  0.9518987
##    8    0.9999950  0.9999850  0.9518987
##   10    0.9999939  0.9999888  0.9518987
##   13    0.9999717  0.9999888  0.9518987
##   15    0.9949039  0.9999888  0.9518987
##   17    0.9949044  0.9999888  0.9518987
##   19    0.9873092  0.9999888  0.9518987
##   22    0.9885763  0.9999888  0.9518987
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
best_mtry <- rf_random_search$bestTune$mtry
cat("Best mtry from Random Search:", best_mtry, "\n")
## Best mtry from Random Search: 4
# Grid Search
# Narrow down mtry based on random search results
rf_grid <- expand.grid(
  mtry = seq(max(1, best_mtry - 2), best_mtry + 2, by = 1) #Make sure mtry is not less than 1
)

# Setting up cross-validation controls for grid searches
train_control_grid <- trainControl(
  method = "cv",        # cross-validation
  number = 5,           # 5-fold 
  classProbs = TRUE,    
  summaryFunction = twoClassSummary # Use of ROC as an assessment indicator
)

# Grid Search
set.seed(123)
rf_grid_search <- train(
  isFraud ~ .,
  data = train_data,
  method = "rf",
  metric = "ROC",
  tuneGrid = rf_grid,
  trControl = train_control_grid,
  ntree = 100
)


print(rf_grid_search)
## Random Forest 
## 
## 267937 samples
##     20 predictor
##      2 classes: 'NonFraud', 'Fraud' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 214350, 214350, 214350, 214349, 214349 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   2     0.9999898  0.9999925  0.9518987
##   3     0.9999943  0.9999888  0.9518987
##   4     0.9999965  0.9999888  0.9518987
##   5     0.9999958  0.9999888  0.9544304
##   6     0.9999974  0.9999850  0.9569620
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
cat("Best mtry from Grid Search:", rf_grid_search$bestTune$mtry, "\n")
## Best mtry from Grid Search: 6
# Model Evaluation
# Evaluate the final model on the test set
final_model <- rf_grid_search$finalModel
test_predictions <- predict(rf_grid_search, newdata = test_data, type = "prob")[, 2]

# Calculate AUC
test_roc <- roc(test_data$isFraud, test_predictions)
## Setting levels: control = NonFraud, case = Fraud
## Setting direction: controls < cases
cat("Test AUC:", auc(test_roc), "\n")
## Test AUC: 0.9970317
# Plotting ROC curves
plot(test_roc, main = "ROC Curve for Final Model")

As can be seen from the provided ROC curve graphs, the model’s classification performance is very strong. The curve is close to the upper left corner, indicating that the model performs well in distinguishing between positive (fraudulent) and negative (non-fraudulent) samples with high Sensitivity and Specificity. In addition, the area under the ROC curve (AUC) is expected to be close to 1, which further validates that the model’s classification ability is almost optimal under different thresholds. However, this near-perfect performance may require vigilance against the possibility of overfitting, especially if the test data is too similar to the training data distribution or there is a data leakage problem. A combination of confusion matrices, classification metrics (e.g., precision and recall), and further examination of the data distribution is needed to ensure the robustness and generalisation ability of the model.

# Confusion Matrix
test_predictions_class <- predict(rf_grid_search, newdata = test_data, type = "raw")
confusion_matrix <- confusionMatrix(test_predictions_class, test_data$isFraud)
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NonFraud  Fraud
##   NonFraud   114659      2
##   Fraud           1    167
##                                              
##                Accuracy : 1                  
##                  95% CI : (0.9999, 1)        
##     No Information Rate : 0.9985             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9911             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.9882             
##          Pos Pred Value : 1.0000             
##          Neg Pred Value : 0.9940             
##              Prevalence : 0.9985             
##          Detection Rate : 0.9985             
##    Detection Prevalence : 0.9985             
##       Balanced Accuracy : 0.9941             
##                                              
##        'Positive' Class : NonFraud           
## 

From the confusion matrix and statistical metrics, the Random Forest model performed extremely well in the classification task. The confusion matrix shows that the model had very few misclassifications in the test set, with 2 predictions in the NonFraud category being misclassified as Fraud and only 1 prediction in the Fraud category being misclassified as NonFraud. the overall Accuracy of the model is 1, with an almost perfect 95% confidence interval (0.9999, 1). The Sensitivity of 1 indicates that the model correctly detects all positive samples (NonFraud), while the Specificity of 0.9882 indicates that the model is also very accurate in classifying negative samples (Fraud). The Positive Pred Value (Pos Pred Value) of 1 indicates that all samples predicted as NonFraud are true NonFraud, while the Negative Pred Value (Neg Pred Value) of 0.994 indicates that most of the samples predicted as Fraud are also accurate.

4.2 Regression Model

# Subset the data for fraudulent transactions
fraud_data <- subset(data, isFraud == 1)

# Remove rows with missing values in the 'amount' column
fraud_data <- fraud_data[!is.na(fraud_data$amount), ]

# Print the distribution of transaction types
cat("Distribution of transaction types:\n")
## Distribution of transaction types:
print(colSums(fraud_data[, grep("^type_", colnames(fraud_data))]))
##  type_CASH_IN type_CASH_OUT    type_DEBIT  type_PAYMENT type_TRANSFER 
##             0            21             0             0           543
# Remove transaction types with zero samples
fraud_data <- fraud_data[, !colnames(fraud_data) %in% c("type_CASH_IN", "type_DEBIT", "type_PAYMENT")]

# Define features and target variable
features <- c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", 
              "type_CASH_OUT", "type_TRANSFER")
y <- fraud_data$amount

# Standardize numerical features
fraud_data[, c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")] <- 
  scale(fraud_data[, c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest")])

# Split the data into training and testing sets
set.seed(42)
train_index <- sample(1:nrow(fraud_data), 0.7 * nrow(fraud_data))
train_data <- fraud_data[train_index, ]
test_data <- fraud_data[-train_index, ]

# Prepare training and testing matrices for glmnet
X_train <- as.matrix(train_data[, features])
y_train <- train_data$amount
X_test <- as.matrix(test_data[, features])
y_test <- test_data$amount

# Perform Lasso regression with cross-validation
lasso_model <- cv.glmnet(X_train, y_train, alpha = 1)
plot(lasso_model)

# Retrieve the best lambda value
best_lambda <- lasso_model$lambda.min
cat("Optimal lambda value:", best_lambda, "\n")
## Optimal lambda value: 28366.47
# Train the final model with the best lambda
final_model <- glmnet(X_train, y_train, alpha = 1, lambda = best_lambda)

# Predict on the testing set
lasso_predictions <- predict(final_model, s = best_lambda, newx = X_test)

# Evaluate the model
lasso_rmse <- sqrt(mean((lasso_predictions - y_test)^2))
lasso_mae <- mean(abs(lasso_predictions - y_test))
lasso_r2 <- 1 - sum((y_test - lasso_predictions)^2) / sum((y_test - mean(y_test))^2)
cat("Model evaluation:\n")
## Model evaluation:
cat("RMSE:", lasso_rmse, "\n")
## RMSE: 58577.56
cat("MAE:", lasso_mae, "\n")
## MAE: 36915.42
cat("R-squared:", lasso_r2, "\n")
## R-squared: 0.998981
# Create a list of linear regression features
linear_features <- c("oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest",
                     "type_CASH_OUT", "type_TRANSFER")

# linear regression modeling
linear_model <- lm(amount ~ ., data = train_data[, c(linear_features, "amount")])

# Predicting on the test set
linear_predictions <- predict(linear_model, newdata = test_data[, linear_features])

# Model Evaluation
linear_rmse <- sqrt(mean((linear_predictions - test_data$amount)^2))
linear_r2 <- 1 - sum((test_data$amount - linear_predictions)^2) / sum((test_data$amount - mean(test_data$amount))^2)

cat("Linear Regression Evaluation:\n")
## Linear Regression Evaluation:
cat("RMSE:", linear_rmse, "\n")
## RMSE: 3807.065
cat("R-squared:", linear_r2, "\n")
## R-squared: 0.9999957
# Adding secondary features to training and test sets
train_data$oldbalanceOrg_sq <- train_data$oldbalanceOrg^2
test_data$oldbalanceOrg_sq <- test_data$oldbalanceOrg^2

train_data$newbalanceOrig_sq <- train_data$newbalanceOrig^2
test_data$newbalanceOrig_sq <- test_data$newbalanceOrig^2

# Create a list of polynomial features
poly_features <- c("oldbalanceOrg", "newbalanceOrig", "oldbalanceOrg_sq", "newbalanceOrig_sq",
                   "oldbalanceDest", "newbalanceDest", "type_CASH_OUT", "type_TRANSFER")

# polynomial regression modeling
poly_model <- lm(amount ~ ., data = train_data[, c(poly_features, "amount")])


poly_predictions <- predict(poly_model, newdata = test_data[, poly_features])

# Model Evaluation
poly_rmse <- sqrt(mean((poly_predictions - test_data$amount)^2))
poly_r2 <- 1 - sum((test_data$amount - poly_predictions)^2) / sum((test_data$amount - mean(test_data$amount))^2)

cat("Polynomial Regression Evaluation:\n")
## Polynomial Regression Evaluation:
cat("RMSE:", poly_rmse, "\n")
## RMSE: 3903.402
cat("R-squared:", poly_r2, "\n")
## R-squared: 0.9999955
# Summarize results into tables
results <- data.frame(
  Model = c("Lasso Regression", "Polynomial Regression", "Linear Regression"),
  RMSE = c(lasso_rmse, poly_rmse, linear_rmse),
  R_squared = c(lasso_r2, poly_r2, linear_r2)
)

kable(results, caption = "Comparison of Regression Models")
Comparison of Regression Models
Model RMSE R_squared
Lasso Regression 58577.562 0.9989810
Polynomial Regression 3903.402 0.9999955
Linear Regression 3807.065 0.9999957
# Creating a comparison data frame
comparison <- data.frame(
  Actual = test_data$amount, 
  Polynomial_Predicted = as.vector(poly_predictions), 
  Linear_Predicted = as.vector(linear_predictions),
  Lasso_Predicted = as.vector(lasso_predictions)
)

# actual vs lasso regression predictions
ggplot(comparison, aes(x = Actual, y = Lasso_Predicted)) +
  geom_point(alpha = 0.6, color = "yellow") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Lasso_ Regression: Actual vs Predicted",
       x = "Actual Values", y = "Predicted Values") +
  theme_minimal()

# actual vs. predicted by multinomial regression

ggplot(comparison, aes(x = Actual, y = Polynomial_Predicted)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Polynomial Regression: Actual vs Predicted",
       x = "Actual Values", y = "Predicted Values") +
  theme_minimal()

# Actual vs. predicted by linear regression

ggplot(comparison, aes(x = Actual, y = Linear_Predicted)) +
  geom_point(alpha = 0.6, color = "green") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Linear Regression: Actual vs Predicted",
       x = "Actual Values", y = "Predicted Values") +
  theme_minimal()

# Feature Importance Analysis
coef_values <- as.vector(coef(final_model, s = best_lambda)[-1])  # Remove intercept
feature_importance <- data.frame(Feature = features, Importance = abs(coef_values))
feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Plot feature importance
ggplot(feature_importance, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Feature Importance from Lasso Regression Model", x = "Features", y = "Importance") +
  theme_minimal()

# Extract coefficients for polynomial regression models

poly_coefficients <- coef(poly_model)

# Creating a Characteristic Importance Data Frame

feature_importance_poly <- data.frame(
  Feature = names(poly_coefficients)[-1],  
  Importance = abs(poly_coefficients[-1]) 
)

# In order of importance
feature_importance_poly <- feature_importance_poly[order(-feature_importance_poly$Importance), ]

# Characteristic importance of visual polynomial regression
ggplot(feature_importance_poly, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "green", alpha = 0.7) +
  coord_flip() +
  labs(title = "Feature Importance from Polynomial Regression",
       x = "Features", y = "Importance") +
  theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

#  Extract the coefficients of the linear regression model
linear_coefficients <- coef(linear_model)

# Creating a Characteristic Importance Data Frame
feature_importance_linear <- data.frame(
  Feature = names(linear_coefficients)[-1],  
  Importance = abs(linear_coefficients[-1])  
)


feature_importance_linear <- feature_importance_linear[order(-feature_importance_linear$Importance), ]

## Visualize the importance of the characteristics of linear regression
ggplot(feature_importance_linear, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "orange", alpha = 0.7) +
  coord_flip() +
  labs(title = "Feature Importance from Linear Regression",
       x = "Features", y = "Importance") +
  theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

# Importance of merger features
combined_feature_importance <- merge(
  feature_importance,
  feature_importance_poly,
  by = "Feature",
  all = TRUE,
  suffixes = c("_Lasso", "_Poly")
)

combined_feature_importance <- merge(
  combined_feature_importance,
  feature_importance_linear,
  by = "Feature",
  all = TRUE,
  suffixes = c("", "_Linear")
)


colnames(combined_feature_importance) <- c("Feature", "Importance_Lasso", "Importance_Poly", "Importance_Linear")


combined_feature_importance[is.na(combined_feature_importance)] <- 0


combined_feature_importance <- combined_feature_importance[order(-combined_feature_importance$Importance_Lasso), ]

print(combined_feature_importance)
##             Feature Importance_Lasso Importance_Poly Importance_Linear
## 5     oldbalanceOrg          2349177   2424834.34021   2424294.6363237
## 2    newbalanceOrig           634935    704887.83231    710621.3190200
## 1    newbalanceDest                0        13.94108         0.6389738
## 3 newbalanceOrig_sq                0       355.36431         0.0000000
## 4    oldbalanceDest                0        15.94334         0.7334692
## 6  oldbalanceOrg_sq                0       430.86093         0.0000000
## 7     type_CASH_OUT                0       122.87337        12.4795722
## 8     type_TRANSFER                0         0.00000         0.0000000

5.0 Result Discussion

5.1 Classification Model:

Evaluation

The Random Forest and XGBoost models were evaluated in an extensive manner using a number of key performance metrics with the aim of ensuring the robustness and generalizability of the models in detecting fraudulent transactions. Both models were optimized using hyperparameter tuning, cross-validation, and then evaluated using the following metrics:

ROC-AUC: This metric assesses the models’ ability to discriminate between fraudulent and non-fraudulent transactions. A higher ROC-AUC score indicates better model performance.

Recall (Sensitivity): Sensitivity ensures that fraudulent transactions are correctly identified, minimizing false negatives.

Specificity: Specificity measures the model’s ability to correctly identify non-fraudulent transactions, reducing false positives.

Accuracy: Accuracy is the overall correctness of the model in classifying both fraudulent and non-fraudulent transactions.

Confusion Matrix: This provides a detailed breakdown of correct and incorrect predictions, offering insight into false positives and false negatives.

Kappa Score: The Kappa score measures the agreement between the model’s predictions and the actual outcomes, accounting for chance.

P-Value: The p-value tests the statistical significance of the model’s performance. Both models were optimized using Random Search and Grid Search for hyperparameter tuning, resulting in high-performance levels. A 5-fold cross-validation approach was used to ensure the models’ robustness and generalizability.

Results

Random Forest Model:

Accuracy: 1.0 (95% CI: 0.9999–1)

Sensitivity (Recall): 1.0000, ensuring no fraudulent transactions were missed.

Specificity: 0.9822, indicating a high rate of correctly identifying legitimate transactions.

Confusion Matrix: 114,659 non-fraudulent and 166 fraudulent transactions correctly classified, with minimal misclassifications (2 false negatives and 1 false positive).

Kappa: 0.9881, signifying a strong agreement between predictions and actual outcomes.

P-Value: <0.0000000000000002, showing statistically significant performance.

McNemar’s Test P-Value: 0.6171, indicating no significant difference in error rates between fraud and non-fraud classifications.

XGBoost Model:

Accuracy: 1.0 (95% CI: 0.9999–1)

Sensitivity: 1.0000, ensuring that all fraudulent transactions were identified.

Specificity: 0.9763, demonstrating strong precision in classifying non-fraudulent transactions.

Confusion Matrix: 114,660 non-fraudulent and 165 fraudulent transactions correctly classified, with 4 false negatives and no false positives.

Kappa: 0.988, showing excellent agreement between predictions and actual outcomes. P-Value: <0.0000000000000002, indicating statistically significant performance.

McNemar’s Test P-Value: 0.1336, indicating no significant difference in error rates between the two classes.

Model Comparison:

Both models had almost identical accuracy scores, 0.9999652, and perfect sensitivity, 1.0000, meaning that no fraudulent transactions were missed.

The Random Forest model had a slightly higher specificity of 0.9822, while the XGBoost model had a slightly lower specificity of 0.9763 but perfectly classified fraudulent transactions with no false positives. The performance of both models was rather outstanding, though XGBoost was a little bit better with sensitivity and Random Forest was a little bit better regarding specificity.

5.2 Regression Model:

Evaluation of Models

Lasso Regression:

Optimal Lambda: 28,366.47 Model Evaluation: RMSE: 58,577.56 MAE: 36,915.42 R-squared: 0.998981

Lasso Regression performed very well, with a high R-squared value, hence a good fit to the data. However, there was slight underestimation of the predicted values, which is evident from the scatter plot, where most of the points lie below the diagonal reference line. However, this model provides a very good insight into the important factors that drive the amount in fraudulent transactions.

Linear Regression:

Model Evaluation: RMSE: 3,807.065 R-squared: 0.9999957

Linear Regression performed very well. With the R-squared value so close to its maximum, it indeed fitted well for both forecasted and actual values. For this value of RMSE, the real results have minor dispersion in their estimations; because of this fact, most scatter points fall close to an ideal reference line. With minimal underestimation, this was one of the powerful forecast models for the target variable.

Polynomial Regression:

Model Evaluation: RMSE: 3,903.402 R-squared: 0.9999955

Polynomial Regression also had a really great performance; its R-squared value is close to Linear Regression. Looking from the scatter plot, there is slight underestimation that may indicate that indeed, Polynomial Regression was as effective, but with marginally more error from Linear Regression. However, it captured the complicated relationship between the factors and provided the necessary insights in the data.

Feature Importance Analysis

The feature importance across the different models, namely, Lasso Regression, Polynomial Regression, and Linear Regression, are giving the important features that mostly influence the predictions of fraudulent transaction amount. Key takeaways for the above Feature Importance plots for different models are as follows:

Lasso Regression: oldbalanceOrg and newbalanceOrig are the features with the maximum importance values; newbalanceDest, newbalanceOrig_sq, and type_CASH_OUT are features for which the feature importance values turn out to be zero, showing their minimal contribution toward the model in question.

Polynomial Regression: This model underlined two most influential features, oldbalanceOrg and newbalanceOrig, although several other features were given lower yet important values, such as newbalanceDest and newbalanceOrig_sq, while type_CASH_OUT also had a moderate value of importance - not reflected from the LASSO model.

Linear Regression: As in Polynomial Regression, Linear Regression granted a high feature importance to oldbalanceOrg and newbalanceOrig, supplementing with smaller contributions with features such as newbalanceDest and type_CASH_OUT.

#6.0 Conclusion

Objective 1:

The key objective of this analysis was to predict the amount of a fraudulent transaction based on several features of the transaction, and both Random Forest and XGBoost models were highly successful in this. Both models showed great results with nearly identical accuracy and sensitivity scores, hence fraud transactions are surely detected.

The Random Forest model performed slightly better in terms of specificity, where it could correctly identify the non-fraudulent transactions, while XGBoost showed perfect sensitivity, thus detecting all fraudulent transactions with perfect precision. Both models performed remarkably well in predicting the fraudulent transaction amounts and hence minimized false positives and false negatives, making a reliable fraud detection system.

In all, the goal of estimating fraudulent transaction amounts was accomplished, and both models are strong tools in detecting fraud with high accuracy. This consolidates the fact that machine learning models, especially those like Random Forest and XGBoost, do actually fit well for fraud detection tasks and may be deployed in the real world for that purpose.

Objective 2:

The use of models such as Lasso Regression, Linear Regression, and Polynomial Regression successfully pinned the most influential features that were driving the amount of fraudulent transactions. Its consistency across different models places oldbalanceOrg and newbalanceOrig among the important features in order to predict how much fraudulent amount of transaction hence very important on fraud detection. First, these two features had the biggest importance values on all the models viewed, which indicated their major and significant role in influencing transaction amounts.

While other features like newbalanceDest, newbalanceOrig_sq, and type_CASH_OUT were important to various extents in the models, they did not bear as much influence as oldbalanceOrg and newbalanceOrig. Specifically, Lasso Regression assigned zero importance to many features, while Linear and Polynomial Regression models captured a broader set of relevant features.

These critical factors were identified by understanding the importance of each feature within the prediction models, along with performance evaluation metrics. The R-squared values being close to 1 in both Linear Regression and Polynomial Regression revealed a very strong fit between the predicted and actual values, confirming that the models have effectively captured the underlying patterns in fraudulent transaction amounts. Second, it seems that all models slightly underestimate making a prediction; although the models are very precise, further adjustments of the models will probably help improve such predictions.

Therefore, the analysis achieved its objective in identifying major factors driving fraudulent transaction amounts and pointed out oldbalanceOrg and newbalanceOrig as most important. The models provided rich insights into what drives fraud and can be useful in further steps of fraud detection and prevention.