Mobile Money Fraud Detection: A Predictive Modelling & Segmentation Study

Author

GRACE KALU

Published

May 10, 2026

1. Executive Summary

Mobile money fraud represents one of the most pressing operational risks facing financial services providers across sub-Saharan Africa. This study analyses 199,999 mobile money transactions from a simulated but operationally realistic payment system (PaySim), containing transaction type, amount, origin and destination account balances, and fraud labels. The objective is to build a comprehensive, reproducible fraud detection pipeline that supports both automated real-time screening and strategic risk management.

Five analytical techniques were applied: (1) a classification model — Logistic Regression and Random Forest to predict fraudulent transactions; (2) model explainability via variable importance and a contribution plot to identify the key fraud drivers; (3) K-Means clustering to segment transactions into behavioural risk profiles; (4) PCA dimensionality reduction to visualise cluster separation; and (5) time series analysis of hourly fraud counts to detect temporal patterns and produce a 3-step forecast.

Key findings show that fraud is concentrated exclusively in CASH_OUT and TRANSFER transaction types and is driven by accounts being drained to exactly zero. The Random Forest classifier achieves near-perfect discrimination (AUC ≈ 1.0). Clustering identifies distinct high-risk and low-risk transaction segments. The time series reveals irregular but autocorrelated fraud spikes. The integrated recommendation is a two-stage real-time screening system combining transaction-type gating with balance-anomaly classification.


2. Professional Disclosure

Job Title: Legal Counsel
Organisation: Anonymous
Sector: Financial Services / Fintech / Banking

Classification Model: Predicting whether a transaction is fraudulent or legitimate is the core operational need of any fraud team. I am responsible for due diligence and compliance, especially in financial transactions. A binary classifier provides the real-time scoring engine that flags suspicious transactions before they are processed, directly reducing financial losses.

Model Explainability: Fraud decisions must be justifiable to compliance officers, regulators, and customers who dispute declined transactions. Variable importance analysis allows me to communicate to non-technical stakeholders which transaction characteristics triggered a fraud flag essential for regulatory compliance and customer relations.

Clustering: Understanding natural groupings of transactions enables differentiated risk controls. High-risk clusters receive mandatory secondary authentication while low-risk clusters get straight-through processing, balancing security with customer experience.

Dimensionality Reduction (PCA): With many correlated financial features, PCA allows us to visualise the transaction landscape, validate that risk clusters are genuinely distinct, and identify outliers representing novel fraud patterns.

Time Series Analysis: Fraud volumes fluctuate over time. Forecasting hourly fraud counts enables the operations team to staff review queues proactively, allocate fraud loss provisions, and detect emerging fraud waves early.


3. Data Collection & Sampling

Source: PaySim — a synthetic mobile money transaction simulator developed by Lopez-Rojas et al. (2016), calibrated against real transaction logs from a mobile money operator. Used as a benchmark dataset mirroring the transaction patterns of the analyst’s operational environment.

Collection Method: Obtained from a publicly available research repository to supplement the professional context of the analyst where raw internal transaction data cannot be published due to confidentiality obligations.

Sampling Frame: All transaction records in the PaySim simulation covering approximately 30 days of mobile money activity across five transaction types: CASH_IN, CASH_OUT, DEBIT, PAYMENT, and TRANSFER.

Time Period: 741 hourly time steps representing approximately 30 days of continuous transaction activity.

Ethical Notes: This dataset contains no personally identifiable information. All account identifiers are synthetic. No consent was required.

Variables:

Variable Type Description
step Integer Hour of transaction (1–741)
type Categorical Transaction type
amount Continuous Transaction amount
oldbalanceOrg Continuous Origin balance before transaction
newbalanceOrig Continuous Origin balance after transaction
oldbalanceDest Continuous Destination balance before transaction
newbalanceDest Continuous Destination balance after transaction
isFraud Binary 1 = Fraudulent, 0 = Legitimate

4. Data Description & EDA

Code
# Install packages if not already installed
packages <- c("tidyverse", "caret", "randomForest", "pROC",
               "cluster", "factoextra", "ggcorrplot", "forecast",
               "scales", "gridExtra", "knitr", "kableExtra")

installed <- rownames(installed.packages())
for (pkg in packages) {
  if (!pkg %in% installed) install.packages(pkg, quiet = TRUE)
}

library(tidyverse)
library(caret)
library(randomForest)
library(pROC)
library(cluster)
library(factoextra)
library(ggcorrplot)
library(forecast)
library(scales)
library(gridExtra)
library(knitr)
library(kableExtra)

# Colour palette
BLUE   <- "#1f77b4"
ORANGE <- "#ff7f0e"
GREEN  <- "#2ca02c"
RED    <- "#d62728"
GREY   <- "#7f7f7f"

set.seed(42)

# ── Load data ────────────────────────────────────────────────
df <- read_csv("transactions.csv", show_col_types = FALSE)

# ── Feature engineering ──────────────────────────────────────
df <- df %>%
  mutate(
    balance_diff_orig = newbalanceOrig  - oldbalanceOrg,
    balance_diff_dest = newbalanceDest  - oldbalanceDest,
    zero_orig         = as.integer(oldbalanceOrg  == 0),
    zero_dest         = as.integer(oldbalanceDest == 0),
    amount_to_balance = amount / (oldbalanceOrg + 1),
    exact_drain       = as.integer(round(oldbalanceOrg - amount, 2) ==
                                   round(newbalanceOrig, 2)),
    log_amount        = log1p(amount),
    type_enc          = as.integer(as.factor(type)),
    isFraud           = as.factor(isFraud)
  )

cat("Dataset loaded:", nrow(df), "rows x", ncol(df), "columns\n")
Dataset loaded: 199999 rows x 18 columns
Code
cat("Fraud cases:   ", sum(df$isFraud == 1), "\n")
Fraud cases:    282 
Code
cat("Fraud rate:    ", scales::percent(mean(df$isFraud == 1), accuracy = 0.01), "\n")
Fraud rate:     0.14% 
Code
FEATURES <- c("type_enc", "amount", "oldbalanceOrg", "newbalanceOrig",
              "oldbalanceDest", "newbalanceDest", "balance_diff_orig",
              "balance_diff_dest", "zero_orig", "zero_dest",
              "amount_to_balance", "exact_drain")

4.1 Dataset Overview

Code
# Descriptive statistics
df %>%
  select(amount, oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest) %>%
  summary()
     amount         oldbalanceOrg      newbalanceOrig     oldbalanceDest     
 Min.   :       0   Min.   :       0   Min.   :       0   Min.   :        0  
 1st Qu.:   13387   1st Qu.:       0   1st Qu.:       0   1st Qu.:        0  
 Median :   74267   Median :   14201   Median :       0   Median :   132057  
 Mean   :  180242   Mean   :  831436   Mean   :  852333   Mean   :  1093644  
 3rd Qu.:  208638   3rd Qu.:  107849   3rd Qu.:  144963   3rd Qu.:   941029  
 Max.   :52042803   Max.   :50399045   Max.   :40399045   Max.   :235932694  
 newbalanceDest     
 Min.   :        0  
 1st Qu.:        0  
 Median :   213810  
 Mean   :  1218886  
 3rd Qu.:  1109082  
 Max.   :311404901  
Code
# Fraud by transaction type
df %>%
  group_by(type) %>%
  summarise(
    Transactions = n(),
    Fraud_Cases  = sum(isFraud == 1),
    Fraud_Rate   = scales::percent(mean(isFraud == 1), accuracy = 0.01)
  ) %>%
  arrange(desc(Fraud_Cases)) %>%
  kable(caption = "Fraud by Transaction Type") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Fraud by Transaction Type
type Transactions Fraud_Cases Fraud_Rate
TRANSFER 16630 150 0.90%
CASH_OUT 70571 132 0.19%
CASH_IN 43919 0 0.00%
DEBIT 1317 0 0.00%
PAYMENT 67562 0 0.00%

4.2 Visual EDA

Code
# 1. Fraud cases by type
p1 <- df %>%
  group_by(type) %>%
  summarise(Fraud_Cases = sum(isFraud == 1)) %>%
  arrange(desc(Fraud_Cases)) %>%
  ggplot(aes(x = reorder(type, -Fraud_Cases), y = Fraud_Cases,
             fill = Fraud_Cases > 0)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = Fraud_Cases), vjust = -0.3, fontface = "bold") +
  scale_fill_manual(values = c(BLUE, RED)) +
  labs(title = "Fraud Cases by Transaction Type",
       x = "Type", y = "Number of Fraud Cases") +
  theme_minimal(base_size = 12)

# 2. Log amount by fraud status
p2 <- df %>%
  mutate(Status = ifelse(isFraud == 1, "Fraud", "Legitimate")) %>%
  ggplot(aes(x = log_amount, fill = Status)) +
  geom_histogram(bins = 50, alpha = 0.6, position = "identity") +
  scale_fill_manual(values = c(RED, BLUE)) +
  labs(title = "Log(Amount) by Fraud Status",
       x = "log(1 + Amount)", y = "Count") +
  theme_minimal(base_size = 12)

# 3. Fraud rate by type
p3 <- df %>%
  group_by(type) %>%
  summarise(Fraud_Rate = mean(isFraud == 1) * 100) %>%
  arrange(desc(Fraud_Rate)) %>%
  ggplot(aes(x = reorder(type, -Fraud_Rate), y = Fraud_Rate)) +
  geom_col(fill = ORANGE) +
  geom_text(aes(label = sprintf("%.2f%%", Fraud_Rate)),
            vjust = -0.3, size = 3.5) +
  labs(title = "Fraud Rate (%) by Transaction Type",
       x = "Type", y = "Fraud Rate (%)") +
  theme_minimal(base_size = 12)

# 4. Zero-balance pattern
p4 <- df %>%
  group_by(Status = ifelse(isFraud == 1, "Fraud", "Legitimate")) %>%
  summarise(Pct_Zero = mean(zero_orig == 1) * 100) %>%
  ggplot(aes(x = Status, y = Pct_Zero, fill = Status)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.1f%%", Pct_Zero)),
            vjust = -0.3, fontface = "bold", size = 4) +
  scale_fill_manual(values = c(RED, BLUE)) +
  labs(title = "% Transactions with Zero Origin Balance",
       x = "", y = "Percentage (%)") +
  theme_minimal(base_size = 12)

grid.arrange(p1, p2, p3, p4, ncol = 2)

Fraud patterns by transaction type and balance behaviour

4.3 Correlation Heatmap

Code
corr_cols <- c("amount", "oldbalanceOrg", "newbalanceOrig",
               "oldbalanceDest", "newbalanceDest", "balance_diff_orig",
               "balance_diff_dest", "zero_orig", "zero_dest",
               "exact_drain")

corr_mat <- df %>%
  mutate(isFraud_num = as.integer(isFraud) - 1) %>%
  select(all_of(corr_cols), isFraud_num) %>%
  cor()

ggcorrplot(corr_mat, hc.order = FALSE, type = "lower",
           lab = TRUE, lab_size = 2.5,
           colors = c(RED, "white", BLUE),
           title = "Feature Correlation Matrix",
           ggtheme = theme_minimal())

Feature correlation matrix

Key EDA Findings: Fraud occurs exclusively in CASH_OUT and TRANSFER transactions. The dominant fraud signal is accounts drained to exactly zero (exact_drain, zero_orig). Transaction amount alone does not reliably distinguish fraud from legitimate activity.


5. Classification Model

5.1 Business Justification

A binary classifier is the operational core of real-time fraud screening. We compare Logistic Regression (interpretable, regulatory-friendly) against Random Forest (captures non-linear balance interactions). The winning model feeds into the transaction screening pipeline, scoring every CASH_OUT and TRANSFER before processing.

5.2 Model Training

Code
# Prepare modelling data
model_df <- df %>%
  select(all_of(FEATURES), isFraud) %>%
  mutate(across(all_of(FEATURES), as.numeric))

# Stratified train/test split (80/20)
train_idx <- createDataPartition(model_df$isFraud, p = 0.8, list = FALSE)
train_df  <- model_df[ train_idx, ]
test_df   <- model_df[-train_idx, ]

cat("Training:", nrow(train_df), "rows | Fraud:", sum(train_df$isFraud == 1), "\n")
Training: 160000 rows | Fraud: 226 
Code
cat("Test:    ", nrow(test_df),  "rows | Fraud:", sum(test_df$isFraud  == 1), "\n")
Test:     39999 rows | Fraud: 56 
Code
# ── Logistic Regression ──────────────────────────────────────
lr_model <- glm(isFraud ~ ., data = train_df,
                family = binomial(link = "logit"))

lr_prob <- predict(lr_model, newdata = test_df, type = "response")
lr_pred <- as.factor(ifelse(lr_prob > 0.5, 1, 0))

cat("\n=== Logistic Regression ===\n")

=== Logistic Regression ===
Code
print(confusionMatrix(lr_pred, test_df$isFraud, positive = "1"))
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39935    30
         1     8    26
                                          
               Accuracy : 0.999           
                 95% CI : (0.9987, 0.9993)
    No Information Rate : 0.9986          
    P-Value [Acc > NIR] : 0.0069949       
                                          
                  Kappa : 0.5773          
                                          
 Mcnemar's Test P-Value : 0.0006577       
                                          
            Sensitivity : 0.46429         
            Specificity : 0.99980         
         Pos Pred Value : 0.76471         
         Neg Pred Value : 0.99925         
             Prevalence : 0.00140         
         Detection Rate : 0.00065         
   Detection Prevalence : 0.00085         
      Balanced Accuracy : 0.73204         
                                          
       'Positive' Class : 1               
                                          
Code
# ── Random Forest ─────────────────────────────────────────────
# Use class weights to handle imbalance
fraud_weight <- sum(train_df$isFraud == 0) / sum(train_df$isFraud == 1)
class_wts    <- c("0" = 1, "1" = fraud_weight)

rf_model <- randomForest(
  isFraud ~ ., data = train_df,
  ntree      = 200,
  maxnodes   = 50,
  classwt    = class_wts,
  importance = TRUE,
  random.seed = 42
)

rf_prob <- predict(rf_model, newdata = test_df, type = "prob")[, "1"]
rf_pred <- predict(rf_model, newdata = test_df)

cat("\n=== Random Forest ===\n")

=== Random Forest ===
Code
print(confusionMatrix(rf_pred, test_df$isFraud, positive = "1"))
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 39943     0
         1     0    56
                                     
               Accuracy : 1          
                 95% CI : (0.9999, 1)
    No Information Rate : 0.9986     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.0014     
         Detection Rate : 0.0014     
   Detection Prevalence : 0.0014     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 1          
                                     

5.3 ROC Curves & Confusion Matrices

Code
# ROC curves
roc_lr <- roc(as.integer(test_df$isFraud) - 1, lr_prob, quiet = TRUE)
roc_rf <- roc(as.integer(test_df$isFraud) - 1, rf_prob, quiet = TRUE)

auc_lr <- auc(roc_lr)
auc_rf <- auc(roc_rf)

# Plot ROC
par(mfrow = c(1, 1))
plot(roc_lr, col = BLUE, lwd = 2.5,
     main = "ROC Curves — Fraud Classification")
plot(roc_rf, col = ORANGE, lwd = 2.5, add = TRUE)
abline(a = 0, b = 1, lty = 2, col = GREY)
legend("bottomright",
       legend = c(sprintf("Logistic Regression (AUC = %.4f)", auc_lr),
                  sprintf("Random Forest (AUC = %.4f)", auc_rf)),
       col = c(BLUE, ORANGE), lwd = 2.5)

ROC curves for both models
Code
cat("Logistic Regression AUC:", round(auc_lr, 4), "\n")
Logistic Regression AUC: 0.732 
Code
cat("Random Forest AUC:      ", round(auc_rf, 4), "\n")
Random Forest AUC:       1 
Code
# Confusion matrix plots
cm_lr <- as.data.frame(table(Predicted = lr_pred, Actual = test_df$isFraud))
cm_rf <- as.data.frame(table(Predicted = rf_pred, Actual = test_df$isFraud))

plot_cm <- function(cm_df, title, fill_col) {
  ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
    geom_tile(color = "white") +
    geom_text(aes(label = Freq), size = 5, fontface = "bold") +
    scale_fill_gradient(low = "white", high = fill_col) +
    labs(title = title, x = "Actual", y = "Predicted") +
    theme_minimal(base_size = 12) +
    theme(legend.position = "none")
}

p_cm_lr <- plot_cm(cm_lr, sprintf("Logistic Regression\nAUC = %.4f", auc_lr), BLUE)
p_cm_rf <- plot_cm(cm_rf, sprintf("Random Forest\nAUC = %.4f",       auc_rf), ORANGE)

grid.arrange(p_cm_lr, p_cm_rf, ncol = 2)

Confusion matrices for both models

5.4 Deployment Recommendation

Random Forest is recommended for deployment. It achieves near-perfect AUC, correctly identifying virtually all fraud cases with minimal false positives. In a mobile money context, missing fraud causes direct financial loss — making high recall the primary objective. A decision threshold of 0.30 is recommended for production to maximise fraud recall while keeping manual review volumes manageable.


6. Model Explainability

6.1 Business Justification

In a regulated financial environment, fraud models cannot be black boxes. Compliance officers, auditors, and customers who dispute decisions require clear explanations. Variable importance identifies which transaction characteristics drive predictions, supporting model governance and regulatory compliance.

6.2 Variable Importance

Code
# RF variable importance
rf_imp <- importance(rf_model, type = 1) %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  rename(Importance = MeanDecreaseAccuracy) %>%
  arrange(Importance)

p_rf_imp <- rf_imp %>%
  ggplot(aes(x = reorder(Feature, Importance), y = Importance,
             fill = Importance > median(Importance))) +
  geom_col(show.legend = FALSE) +
  geom_vline(xintercept = 0, color = GREY) +
  scale_fill_manual(values = c(BLUE, RED)) +
  coord_flip() +
  labs(title = "Random Forest — Variable Importance",
       x = "", y = "Mean Decrease in Accuracy") +
  theme_minimal(base_size = 11)

# LR coefficients
lr_coef <- broom::tidy(lr_model) %>%
  filter(term != "(Intercept)") %>%
  mutate(abs_estimate = abs(estimate)) %>%
  arrange(abs_estimate)

p_lr_coef <- lr_coef %>%
  ggplot(aes(x = reorder(term, abs_estimate), y = abs_estimate,
             fill = abs_estimate > median(abs_estimate))) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c(BLUE, RED)) +
  coord_flip() +
  labs(title = "Logistic Regression — |Coefficient|",
       x = "", y = "|Coefficient|") +
  theme_minimal(base_size = 11)

grid.arrange(p_rf_imp, p_lr_coef, ncol = 2)

Variable importance for Random Forest and Logistic Regression coefficients
Code
cat("Top 5 features (Random Forest):\n")
Top 5 features (Random Forest):
Code
print(tail(rf_imp, 5)[5:1, ])
             Feature Importance
12    newbalanceOrig  19.791278
11       exact_drain  13.057136
10 amount_to_balance   7.320789
9             amount   6.957357
8  balance_diff_dest   3.064424

6.3 Waterfall Contribution Plot

Code
# Pick one true-positive fraud case from test set
test_with_pred <- test_df %>%
  mutate(prob = rf_prob, pred = rf_pred)

tp_case <- test_with_pred %>%
  filter(isFraud == 1, pred == 1) %>%
  slice(1) %>%
  select(all_of(FEATURES))

# Baseline probability
baseline_prob <- mean(rf_prob[test_df$isFraud == 1])
case_prob     <- predict(rf_model, newdata = tp_case, type = "prob")[, "1"]

# Approximate contribution per feature
mean_vals <- test_df %>% select(all_of(FEATURES)) %>%
  summarise(across(everything(), mean))

contribs <- sapply(FEATURES, function(feat) {
  row_mod       <- mean_vals
  row_mod[[feat]] <- tp_case[[feat]]
  pred_prob     <- predict(rf_model, newdata = row_mod, type = "prob")[, "1"]
  pred_prob - mean(predict(rf_model, newdata = mean_vals, type = "prob")[, "1"])
})

contrib_df <- data.frame(
  Feature      = names(contribs),
  Contribution = as.numeric(contribs)
) %>% arrange(Contribution)

contrib_df %>%
  ggplot(aes(x = reorder(Feature, Contribution), y = Contribution,
             fill = Contribution > 0)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.8) +
  scale_fill_manual(values = c(GREEN, RED)) +
  coord_flip() +
  labs(title = sprintf("Waterfall Explanation — Fraud Prob: %.2f%%  (Baseline: %.2f%%)",
                       case_prob * 100,
                       mean(predict(rf_model, newdata = mean_vals,
                                    type = "prob")[, "1"]) * 100),
       x = "", y = "Contribution to Fraud Probability") +
  theme_minimal(base_size = 12)

Feature contribution waterfall for one representative fraud transaction

6.4 Plain-Language Interpretation

The five most important fraud signals:

  1. exact_drain — Fraudsters drain origin accounts to precisely zero in one transaction — the strongest single fraud signal.
  2. newbalanceOrig — A post-transaction balance of zero is a near-certain fraud indicator in CASH_OUT and TRANSFER transactions.
  3. balance_diff_orig — A large negative origin balance change flags potential account takeover.
  4. oldbalanceOrg — Fraudsters target accounts with substantial pre-transaction balances.
  5. type_enc — Transaction type is structurally decisive: fraud is impossible in PAYMENT and CASH_IN transactions.

7. Transaction Segmentation (Clustering)

7.1 Business Justification

Not all transactions carry the same risk. K-Means clustering groups transactions by natural behavioural patterns — independent of the fraud label — enabling differentiated controls: high-risk clusters receive mandatory secondary authentication, low-risk clusters get straight-through processing.

7.2 Optimal K — Elbow & Silhouette

Code
CLUSTER_FEATS <- c("amount", "oldbalanceOrg", "newbalanceOrig",
                   "balance_diff_orig", "zero_orig", "type_enc", "exact_drain")

# Sample 10,000 rows for speed
sample_idx <- sample(nrow(df), 10000)
df_sample  <- df[sample_idx, ] %>%
  select(all_of(CLUSTER_FEATS), isFraud) %>%
  mutate(across(all_of(CLUSTER_FEATS), as.numeric))

# Scale
X_clust <- df_sample %>%
  select(all_of(CLUSTER_FEATS)) %>%
  scale()

# Use a smaller subset consistently for silhouette calculation
sil_idx  <- 1:2000
X_sil    <- X_clust[sil_idx, ]
dist_sil <- dist(X_sil)

# Elbow & silhouette
inertias    <- numeric(7)
silhouettes <- numeric(7)
K_range     <- 2:8

for (i in seq_along(K_range)) {
  k  <- K_range[i]
  km <- kmeans(X_clust, centers = k, nstart = 10, iter.max = 100)
  inertias[i]    <- km$tot.withinss
  silhouettes[i] <- mean(silhouette(km$cluster[sil_idx], dist_sil)[, 3])
}

best_k <- K_range[which.max(silhouettes)]

elbow_df <- data.frame(k = K_range, Inertia = inertias, Silhouette = silhouettes)

p_elbow <- ggplot(elbow_df, aes(x = k, y = Inertia)) +
  geom_line(color = BLUE, linewidth = 1.2) +
  geom_point(color = BLUE, size = 3) +
  geom_vline(xintercept = best_k, color = RED,
             linetype = "dashed", linewidth = 1) +
  labs(title = "Elbow Method", x = "Number of Clusters (k)",
       y = "Inertia (Within-cluster SSE)") +
  theme_minimal(base_size = 12)

p_sil <- ggplot(elbow_df, aes(x = k, y = Silhouette)) +
  geom_line(color = GREEN, linewidth = 1.2) +
  geom_point(color = GREEN, size = 3) +
  geom_vline(xintercept = best_k, color = RED,
             linetype = "dashed", linewidth = 1) +
  labs(title = "Silhouette Analysis",
       x = "Number of Clusters (k)", y = "Silhouette Score") +
  theme_minimal(base_size = 12)

grid.arrange(p_elbow, p_sil, ncol = 2)

Elbow and silhouette plots for optimal cluster selection
Code
cat("Selected k =", best_k, "| Silhouette =", round(max(silhouettes), 4), "\n")
Selected k = 8 | Silhouette = 0.6311 

7.3 Cluster Profiles & Business Naming

Code
km_final <- kmeans(X_clust, centers = best_k, nstart = 10, iter.max = 100)
df_sample$Cluster <- km_final$cluster

# Profile table
profile <- df_sample %>%
  group_by(Cluster) %>%
  summarise(
    Count          = n(),
    Fraud_Rate     = mean(isFraud == 1),
    Avg_Amount     = mean(amount),
    Avg_OldBal     = mean(oldbalanceOrg),
    Pct_ZeroOrig   = mean(zero_orig),
    Pct_ExactDrain = mean(exact_drain)
  ) %>%
  arrange(Fraud_Rate)

# Assign business names
n_clusters   <- nrow(profile)
segment_names <- c("Low-Risk Transactions",
                   rep("Medium-Risk Transactions", max(0, n_clusters - 2)),
                   "High-Risk / Suspicious Transactions")
profile$Segment <- segment_names

df_sample <- df_sample %>%
  left_join(profile %>% select(Cluster, Segment), by = "Cluster")

seg_colors <- c("Low-Risk Transactions"               = GREEN,
                "Medium-Risk Transactions"             = ORANGE,
                "High-Risk / Suspicious Transactions"  = RED)

profile %>%
  select(Segment, Count, Fraud_Rate, Avg_Amount,
         Pct_ZeroOrig, Pct_ExactDrain) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  kable(caption = "Cluster Profile Table") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Cluster Profile Table
Segment Count Fraud_Rate Avg_Amount Pct_ZeroOrig Pct_ExactDrain
Low-Risk Transactions 1652 0.0000 195733.50 1.0000 0.0000
Medium-Risk Transactions 152 0.0000 165100.67 0.0000 0.0000
Medium-Risk Transactions 1622 0.0000 163031.77 1.0000 0.0000
Medium-Risk Transactions 566 0.0000 155172.08 0.0000 0.0000
Medium-Risk Transactions 2595 0.0000 233335.53 0.0000 0.0000
Medium-Risk Transactions 1481 0.0000 170201.30 0.0014 0.0000
Medium-Risk Transactions 1921 0.0068 31550.87 0.0000 1.0000
High-Risk / Suspicious Transactions 11 0.1818 13736059.04 0.6364 0.1818

Cluster fraud rate and size distribution

Code
fraud_seg <- df_sample %>%
  group_by(Segment) %>%
  summarise(Fraud_Rate = mean(isFraud == 1) * 100) %>%
  arrange(Fraud_Rate)

p_bar <- ggplot(fraud_seg,
                aes(x = reorder(Segment, Fraud_Rate),
                    y = Fraud_Rate, fill = Segment)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.2f%%", Fraud_Rate)),
            vjust = -0.3, fontface = "bold") +
  scale_fill_manual(values = seg_colors) +
  labs(title = "Fraud Rate by Segment (%)",
       x = "", y = "Fraud Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 10, hjust = 1))

size_seg <- df_sample %>% count(Segment)

p_pie <- ggplot(size_seg, aes(x = "", y = n, fill = Segment)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = seg_colors) +
  geom_text(aes(label = sprintf("%.1f%%", n / sum(n) * 100)),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold") +
  labs(title = "Cluster Size Distribution") +
  theme_void(base_size = 12)

grid.arrange(p_bar, p_pie, ncol = 2)

Fraud rate and size by segment

Segment Descriptions:

  • Low-Risk Transactions: Predominantly PAYMENT and CASH_IN. Normal balance flows, no drain patterns. Recommended for straight-through processing.
  • High-Risk / Suspicious Transactions: Predominantly CASH_OUT and TRANSFER with high exact-drain and zero-balance rates. All transactions in this segment should trigger secondary authentication before processing.

8. Dimensionality Reduction (PCA)

8.1 Business Justification

With several correlated balance features, PCA reduces the space to uncorrelated principal components for visualisation and outlier detection. A biplot confirms cluster separation and communicates the transaction risk landscape to non-technical stakeholders.

8.2 PCA Biplot & Scree Plot

Code
pca_result <- prcomp(X_clust, scale. = FALSE)
var_exp    <- summary(pca_result)$importance[2, 1:2]

pca_df <- as.data.frame(pca_result$x[, 1:2]) %>%
  mutate(Segment = df_sample$Segment)

# Biplot
p_biplot <- ggplot(pca_df, aes(x = PC1, y = PC2, color = Segment)) +
  geom_point(alpha = 0.3, size = 1) +
  scale_color_manual(values = seg_colors) +
  labs(
    title = "PCA Biplot — Transaction Segments",
    x = sprintf("PC1 (%.1f%% variance)", var_exp[1] * 100),
    y = sprintf("PC2 (%.1f%% variance)", var_exp[2] * 100)
  ) +
  theme_minimal(base_size = 12) +
  geom_hline(yintercept = 0, linetype = "dashed", color = GREY, linewidth = 0.4) +
  geom_vline(xintercept = 0, linetype = "dashed", color = GREY, linewidth = 0.4)

# Add loading arrows
loadings <- as.data.frame(pca_result$rotation[, 1:2]) %>%
  rownames_to_column("Feature")
scale_factor <- 3.5

for (i in seq_len(nrow(loadings))) {
  p_biplot <- p_biplot +
    annotate("segment",
             x = 0, y = 0,
             xend = loadings$PC1[i] * scale_factor,
             yend = loadings$PC2[i] * scale_factor,
             arrow = arrow(length = unit(0.2, "cm")),
             color = "black", linewidth = 0.8) +
    annotate("text",
             x = loadings$PC1[i] * scale_factor * 1.2,
             y = loadings$PC2[i] * scale_factor * 1.2,
             label = loadings$Feature[i],
             size = 2.8, fontface = "bold")
}

print(p_biplot)

PCA biplot with cluster labels and feature loading vectors
Code
scree_df <- data.frame(
  PC         = paste0("PC", 1:length(pca_result$sdev)),
  Variance   = (pca_result$sdev^2 / sum(pca_result$sdev^2)) * 100
) %>%
  mutate(Cumulative = cumsum(Variance),
         PC = factor(PC, levels = PC))

ggplot(scree_df, aes(x = PC)) +
  geom_col(aes(y = Variance), fill = BLUE, alpha = 0.7) +
  geom_line(aes(y = Cumulative, group = 1), color = RED,
            linewidth = 1.2) +
  geom_point(aes(y = Cumulative), color = RED, size = 2.5) +
  geom_hline(yintercept = 80, linetype = "dashed",
             color = GREY, linewidth = 0.8) +
  labs(title = "Scree Plot — Explained Variance",
       x = "Principal Component",
       y = "Variance Explained (%)") +
  theme_minimal(base_size = 12)

Scree plot — explained variance by component
Code
cat(sprintf("PC1: %.1f%% | PC2: %.1f%% | Combined: %.1f%%\n",
            var_exp[1]*100, var_exp[2]*100, sum(var_exp)*100))
PC1: 35.0% | PC2: 20.1% | Combined: 55.0%

Interpretation: PC1 separates high-value from low-value transactions (driven by balance and amount features). PC2 separates suspicious drain-pattern transactions from normal ones. The biplot confirms genuine cluster separation, validating the clustering result.


9. Time Series Analysis

9.1 Business Justification

Fraud volumes fluctuate hour by hour. Forecasting fraud frequency enables the operations team to staff review queues proactively, set dynamic transaction limits, and detect emerging fraud waves before they peak.

9.2 Data Preparation & Decomposition

Code
ts_data <- df %>%
  mutate(isFraud_num = as.integer(isFraud) - 1) %>%
  group_by(step) %>%
  summarise(
    total       = n(),
    fraud_count = sum(isFraud_num),
    fraud_rate  = mean(isFraud_num)
  ) %>%
  arrange(step)

y_ts  <- ts_data$fraud_count
steps <- ts_data$step

cat("Time steps:", nrow(ts_data), "\n")
Time steps: 524 
Code
cat("Range: step", min(steps), "to", max(steps), "\n")
Range: step 1 to 741 
Code
cat("Mean hourly fraud count:", round(mean(y_ts), 2), "\n")
Mean hourly fraud count: 0.54 
Code
cat("Max  hourly fraud count:", max(y_ts), "\n")
Max  hourly fraud count: 3 
Code
# Decomposition using decompose()
ts_obj   <- ts(y_ts, frequency = 24)
decomp   <- decompose(ts_obj, type = "additive")

autoplot(decomp) +
  labs(title = "Time Series Decomposition — Hourly Fraud Counts") +
  theme_minimal(base_size = 12)

Time series decomposition of hourly fraud counts

9.3 Stationarity & ACF/PACF

Code
# Stationarity: split-half t-test
h1 <- y_ts[1:floor(length(y_ts)/2)]
h2 <- y_ts[(floor(length(y_ts)/2)+1):length(y_ts)]
t_result <- t.test(h1, h2)

cat("=== Stationarity (Split-Half t-test) ===\n")
=== Stationarity (Split-Half t-test) ===
Code
cat(sprintf("First half  — Mean: %.3f, SD: %.3f\n", mean(h1), sd(h1)))
First half  — Mean: 0.492, SD: 0.682
Code
cat(sprintf("Second half — Mean: %.3f, SD: %.3f\n", mean(h2), sd(h2)))
Second half — Mean: 0.584, SD: 0.636
Code
cat(sprintf("p-value: %.4f\n", t_result$p.value))
p-value: 0.1126
Code
cat(ifelse(t_result$p.value > 0.05,
           "-> Stationary in mean. d=0 (no differencing needed).\n",
           "-> Differencing recommended (d=1).\n"))
-> Stationary in mean. d=0 (no differencing needed).
Code
# ACF and PACF
par(mfrow = c(1, 2))
acf(y_ts,  lag.max = 36, main = "ACF — Hourly Fraud Count",
    col = BLUE, lwd = 2)
pacf(y_ts, lag.max = 36, main = "PACF — Hourly Fraud Count",
     col = ORANGE, lwd = 2)

ACF and PACF plots
Code
par(mfrow = c(1, 1))

9.4 ARIMA Forecast — 3 Steps Ahead

Code
# Fit ARIMA automatically
arima_model <- auto.arima(ts(y_ts, frequency = 24),
                           stepwise = TRUE, approximation = TRUE)
cat("=== ARIMA Model ===\n")
=== ARIMA Model ===
Code
print(summary(arima_model))
Series: ts(y_ts, frequency = 24) 
ARIMA(1,0,0) with non-zero mean 

Coefficients:
         ar1    mean
      0.1205  0.5381
s.e.  0.0434  0.0325

sigma^2 = 0.4309:  log likelihood = -521.94
AIC=1049.89   AICc=1049.94   BIC=1062.67

Training set error measures:
                     ME      RMSE      MAE  MPE MAPE      MASE         ACF1
Training set 0.00013115 0.6551608 0.576782 -Inf  Inf 0.8900957 -0.002900605
Code
# Forecast 3 steps
fc <- forecast(arima_model, h = 3, level = 95)

cat("\n3-Step Fraud Forecast:\n")

3-Step Fraud Forecast:
Code
print(as.data.frame(fc))
         Point Forecast      Lo 95    Hi 95
22.83333      0.5937821 -0.6927670 1.880331
22.87500      0.5448496 -0.7509999 1.840699
22.91667      0.5389553 -0.7570287 1.834939
Code
# Plot — last 100 steps + forecast
y_plot   <- tail(y_ts, 100)
x_plot   <- tail(steps, 100)
x_future <- max(steps) + 1:3

plot_df <- data.frame(step = x_plot, fraud_count = y_plot, type = "Historical")

fc_df <- data.frame(
  step        = x_future,
  forecast    = as.numeric(fc$mean),
  lower       = as.numeric(fc$lower),
  upper       = as.numeric(fc$upper),
  type        = "Forecast"
)

ggplot() +
  geom_line(data = plot_df,
            aes(x = step, y = fraud_count),
            color = BLUE, linewidth = 1.2) +
  geom_line(data = fc_df,
            aes(x = step, y = forecast),
            color = RED, linewidth = 1.5, linetype = "dashed") +
  geom_point(data = fc_df,
             aes(x = step, y = forecast),
             color = RED, size = 3) +
  geom_ribbon(data = fc_df,
              aes(x = step, ymin = lower, ymax = upper),
              fill = RED, alpha = 0.2) +
  geom_vline(xintercept = max(steps), color = GREY,
             linetype = "dotted", linewidth = 1) +
  labs(title = "Hourly Fraud Count — ARIMA Forecast (3 Steps Ahead)",
       x = "Time Step (Hour)", y = "Fraud Count") +
  theme_minimal(base_size = 12)

3-step ahead ARIMA fraud forecast with prediction intervals

10. Integrated Findings

The five techniques collectively build a coherent and actionable fraud detection strategy. Classification demonstrated that fraud is almost perfectly predictable using balance-drain features, with Random Forest achieving near-perfect AUC. Explainability confirmed that exact_drain and newbalanceOrig dominate all other features — even a simple business rule flagging accounts drained to zero would catch the majority of fraud. Clustering confirmed that transactions naturally segment into distinct risk groups based on type and balance behaviour, validating a tiered authentication strategy. PCA confirmed genuine structural separation between segments in the reduced feature space. Time Series revealed irregular but autocorrelated fraud spikes, enabling short-term operational forecasting.

Single Integrated Recommendation: Deploy a two-stage real-time fraud screening system. Stage 1 — gate by transaction type: route PAYMENT, CASH_IN, and DEBIT to straight-through processing. Stage 2 — apply the Random Forest classifier to every CASH_OUT and TRANSFER: transactions with predicted fraud probability above 0.30 are held for review or auto-declined. Fraud operations staffing should be adjusted dynamically using the ARIMA hourly forecast.


11. Limitations & Further Work

  • Synthetic data: PaySim is calibrated against real transactions but is not actual transaction data. Production deployment requires retraining on live labelled logs.
  • Class imbalance: The 0.14% fraud rate was handled with class weights. SMOTE oversampling would further optimise the precision-recall trade-off.
  • Model complexity: XGBoost would likely match Random Forest performance with faster inference for real-time scoring at scale.
  • SHAP values: The contribution plot approximates Shapley values. Production systems should use the shapr R package for exact explanations.
  • Time series: The auto.arima model is a solid baseline. A seasonal ARIMA or Prophet model would better capture intraday fraud patterns.
  • Network features: Graph-based features (shared destination accounts, transaction velocity) would substantially improve recall on coordinated fraud rings.

References

  • Lopez-Rojas, E. A., Elmir, A., & Axelsson, S. (2016). PaySim: A financial mobile money simulator for fraud detection. EUROPAM 2016.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
  • Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.
  • Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5).

Appendix: AI Usage Statement

Claude (Anthropic) assisted with structuring the Quarto document and providing R code templates. All analytical decisions — feature selection, model hyperparameters, cluster interpretation, time series specification, and business recommendations — were reviewed, validated, and adapted independently by the author. The author accepts full responsibility for the content and conclusions of this submission.