Real-time Fraud Detection and Cybersecurity in Digital Payments

Opayinka Mercy O

Project Topic

Real-time Fraud Detection and Cybersecurity in Digital Payments

Problem Statement

With the rapid growth of digital payments (mobile money, online banking) in Nigeria, financial institutions are increasingly vulnerable to sophisticated fraud schemes and cyberattacks. Detecting these in real-time, often with limited historical fraud data, is crucial for maintaining trust and financial security. Traditional methods often fail to catch new, evolving fraud patterns.

Aim

Detect Fraud in real time.

Objectives

Leveraging unsupervised learning algorithms (Isolation Forest) to identify unusual transaction patterns that deviate from normal behavior.
Real‑time fraud scoring via an interactive Shiny app

Brief Description

This project focuses on building an automated defense mechanism against payment fraud. Using the Isolation Forest algorithm, the system acts as a “security robot” that learns the complex patterns of legitimate transactions from the NIBSS dataset (1,000,000 records). Unlike traditional rules-based systems, this model detects anomalies based on deviation rather than fixed rules, allowing it to catch novel and zero-day fraud attacks. The solution is fully implemented as a deployable Shiny application that provides real-time scanning, risk scoring, and interactive analytics for fraud investigators.

Step 1: Loading and Understanding the Data

# Load the NIBSS fraud dataset
df <- read.csv("nibss_fraud_dataset.csv")

# Display basic information
print(dim(df))

## [1] 1000000      38

head(df)

##     transaction_id   customer_id           timestamp    amount channel
## 1 TXN_F08A86FFD87C CUST_0002AED1 2023-01-14 04:31:09  32266.83  Mobile
## 2 TXN_C2D08134EC83 CUST_0002AED1 2023-01-17 11:20:13  72530.49     Web
## 3 TXN_B9499111901D CUST_0002AED1 2023-01-22 02:17:46 168152.87  Mobile
## 4 TXN_48DB1D526A3B CUST_0002AED1 2023-01-24 08:18:23  16439.93  Mobile
## 5 TXN_56DB1E28B758 CUST_0002AED1 2023-02-01 15:39:53   9922.68     POS
## 6 TXN_8CB46D78CEED CUST_0002AED1 2023-02-08 16:27:19  80685.56     Web
##   merchant_category      bank location age_group hour day_of_week month
## 1           Grocery  Sterling    Other     30-39    4           5     1
## 2     Entertainment       UBA    Other     30-39   11           1     1
## 3         Transport      Wema    Other     30-39    2           6     1
## 4     Entertainment      FCMB    Other     30-39    8           1     1
## 5         Education FirstBank    Other     30-39   15           2     2
## 6        Restaurant    GTBank    Other     30-39   16           2     2
##   is_weekend is_peak_hour tx_count_24h amount_sum_24h amount_mean_7d
## 1       True        False            1       32266.83       32266.83
## 2      False         True            1       72530.49       52398.66
## 3       True        False            1      168152.87      120341.68
## 4      False        False            1       16439.93       85707.76
## 5      False         True            1        9922.68        9922.68
## 6      False         True            1       80685.56       80685.56
##   amount_std_7d tx_count_total amount_mean_total amount_std_total
## 1          0.00            107          170389.9         365915.9
## 2      20131.83            107          170389.9         365915.9
## 3      47811.19            107          170389.9         365915.9
## 4      62633.51            107          170389.9         365915.9
## 5          0.00            107          170389.9         365915.9
## 6          0.00            107          170389.9         365915.9
##   channel_diversity location_diversity amount_vs_mean_ratio
## 1                 5                  1           0.18936948
## 2                 5                  1           0.42567123
## 3                 5                  1           0.98686550
## 4                 5                  1           0.09648363
## 5                 5                  1           0.05823481
## 6                 5                  1           0.47353218
##   online_channel_ratio is_fraud fraud_technique   hour_sin   hour_cos
## 1            0.7757009        0                  0.8660254  0.5000000
## 2            0.7757009        0                  0.2588190 -0.9659258
## 3            0.7757009        0                  0.5000000  0.8660254
## 4            0.7757009        0                  0.8660254 -0.5000000
## 5            0.7757009        0                 -0.7071068 -0.7071068
## 6            0.7757009        0                 -0.8660254 -0.5000000
##      day_sin    day_cos month_sin month_cos amount_log amount_rounded
## 1 -0.9749279 -0.2225209 0.5000000 0.8660254  10.381826              0
## 2  0.7818315  0.6234898 0.5000000 0.8660254  11.191776              0
## 3 -0.7818315  0.6234898 0.5000000 0.8660254  12.032635              0
## 4  0.7818315  0.6234898 0.5000000 0.8660254   9.707529              0
## 5  0.9749279 -0.2225209 0.8660254 0.5000000   9.202679              0
## 6  0.9749279 -0.2225209 0.8660254 0.5000000  11.298327              0
##   velocity_score merchant_risk_score composite_risk
## 1     0.18936948           0.2149999     0.07055978
## 2     0.42567123           0.8774244     0.27684880
## 3     0.98686550           0.4402304     0.16364883
## 4     0.09648363           0.8774244     0.26631480
## 5     0.05823481           0.2312907     0.07125073
## 6     0.47353218           0.6084928     0.19770088

str(df)

## 'data.frame':    1000000 obs. of  38 variables:
##  $ transaction_id      : chr  "TXN_F08A86FFD87C" "TXN_C2D08134EC83" "TXN_B9499111901D" "TXN_48DB1D526A3B" ...
##  $ customer_id         : chr  "CUST_0002AED1" "CUST_0002AED1" "CUST_0002AED1" "CUST_0002AED1" ...
##  $ timestamp           : chr  "2023-01-14 04:31:09" "2023-01-17 11:20:13" "2023-01-22 02:17:46" "2023-01-24 08:18:23" ...
##  $ amount              : num  32267 72530 168153 16440 9923 ...
##  $ channel             : chr  "Mobile" "Web" "Mobile" "Mobile" ...
##  $ merchant_category   : chr  "Grocery" "Entertainment" "Transport" "Entertainment" ...
##  $ bank                : chr  "Sterling" "UBA" "Wema" "FCMB" ...
##  $ location            : chr  "Other" "Other" "Other" "Other" ...
##  $ age_group           : chr  "30-39" "30-39" "30-39" "30-39" ...
##  $ hour                : int  4 11 2 8 15 16 13 16 10 19 ...
##  $ day_of_week         : int  5 1 6 1 2 2 2 2 3 0 ...
##  $ month               : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ is_weekend          : chr  "True" "False" "True" "False" ...
##  $ is_peak_hour        : chr  "False" "True" "False" "False" ...
##  $ tx_count_24h        : num  1 1 1 1 1 1 1 2 3 1 ...
##  $ amount_sum_24h      : num  32267 72530 168153 16440 9923 ...
##  $ amount_mean_7d      : num  32267 52399 120342 85708 9923 ...
##  $ amount_std_7d       : num  0 20132 47811 62634 0 ...
##  $ tx_count_total      : int  107 107 107 107 107 107 107 107 107 107 ...
##  $ amount_mean_total   : num  170390 170390 170390 170390 170390 ...
##  $ amount_std_total    : num  365916 365916 365916 365916 365916 ...
##  $ channel_diversity   : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ location_diversity  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ amount_vs_mean_ratio: num  0.1894 0.4257 0.9869 0.0965 0.0582 ...
##  $ online_channel_ratio: num  0.776 0.776 0.776 0.776 0.776 ...
##  $ is_fraud            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fraud_technique     : chr  "" "" "" "" ...
##  $ hour_sin            : num  0.866 0.259 0.5 0.866 -0.707 ...
##  $ hour_cos            : num  0.5 -0.966 0.866 -0.5 -0.707 ...
##  $ day_sin             : num  -0.975 0.782 -0.782 0.782 0.975 ...
##  $ day_cos             : num  -0.223 0.623 0.623 0.623 -0.223 ...
##  $ month_sin           : num  0.5 0.5 0.5 0.5 0.866 ...
##  $ month_cos           : num  0.866 0.866 0.866 0.866 0.5 ...
##  $ amount_log          : num  10.38 11.19 12.03 9.71 9.2 ...
##  $ amount_rounded      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ velocity_score      : num  0.1894 0.4257 0.9869 0.0965 0.0582 ...
##  $ merchant_risk_score : num  0.215 0.877 0.44 0.877 0.231 ...
##  $ composite_risk      : num  0.0706 0.2768 0.1636 0.2663 0.0713 ...

Dataset Description: This dataset was sourced from Kaggle and is a meticulously crafted synthetic dataset containing 1,000,000 financial transactions. It is specifically calibrated to reflect real Nigerian banking patterns using official NIBSS (Nigerian Interbank Settlement System) 2023 fraud landscape statistics.

Data Dictionary (Key Features): Here is a breakdown of the columns we are working with

transaction_id Unique identifier for each transaction.

customer_id Unique identifier for the account holder.

amount The value of the transaction in Naira (₦).

timestamp Date and time the transaction occurred.

channel The platform used (e.g., Mobile, POS, ATM, Web).

bank The financial institution processing the transaction.

merchant_category The industry of the recipient (e.g., Retail, Utilities).

is_fraud Target Variable: 1 if Fraud, 0 if Legitimate.

velocity_score Calculated metric indicating speed of transactions.

merchant_risk_score Pre-calculated risk level of the merchant.

composite_risk Aggregated risk score combining multiple factors.

tx_count_24h Number of transactions by this customer in last 24h.

amount_vs_mean_ratio How much this amount differs from customer’s average.

fraud_technique The method used for fraud (e.g., Phishing).

location Geographic location of the transaction.

age_group Demographic segment of the customer.

This dataset contains 1,000,000 transactions with 38 features. This substantial volume provides the model with enough examples to learn the complex patterns of normal transaction behavior. Each row represents a single transaction with details including amount, timestamp, channel, bank, customer behavior metrics, risk indicators etc.

Step 2: Data Cleaning and Preprocessing

Machine learning algorithms require clean, properly formatted data. Missing values or incorrect data types can cause the algorithm to fail or produce unreliable results.

2.1 Checking for Missing Values

# Check for missing values in each column
missing_counts <- colSums(is.na(df))
print(missing_counts[missing_counts > 0])

## named numeric(0)

# Total missing values
total_missing <- sum(is.na(df))
cat("\nTotal missing values:", total_missing, "\n")

## 
## Total missing values: 0

Result: The dataset is complete with no missing values. This is excellent news, we can proceed without needing imputation strategies that might introduce bias.

2.2 Date Conversion

# Convert timestamp from character to datetime format
df$timestamp <- as.POSIXct(df$timestamp, format = "%Y-%m-%d %H:%M:%S")

# Verify the conversion
cat("Timestamp class:", class(df$timestamp), "\n")

## Timestamp class: POSIXct POSIXt

cat("Sample timestamps:\n")

## Sample timestamps:

print(head(df$timestamp))

## [1] "2023-01-14 04:31:09 WAT" "2023-01-17 11:20:13 WAT"
## [3] "2023-01-22 02:17:46 WAT" "2023-01-24 08:18:23 WAT"
## [5] "2023-02-01 15:39:53 WAT" "2023-02-08 16:27:19 WAT"

str(df$timestamp)

##  POSIXct[1:1000000], format: "2023-01-14 04:31:09" "2023-01-17 11:20:13" "2023-01-22 02:17:46" ...

Result: The timestamp column is now properly formatted as a POSIXct datetime object. This allows us to extract time-based features and analyze temporal patterns in fraud.

Step 3: Exploratory Data Analysis

Before building this model, understanding the data’s characteristics is very important. In unsupervised learning,the major interest is identifying potential outliers and understanding the distribution of key features.

3.1 Fraud Distribution

# Calculate fraud statistics
fraud_table <- table(df$is_fraud)
fraud_pct <- prop.table(fraud_table) * 100

cat("Fraud Distribution:\n")

## Fraud Distribution:

print(fraud_table)

## 
##      0      1 
## 997000   3000

cat("\nPercentages:\n")

## 
## Percentages:

print(fraud_pct)

## 
##    0    1 
## 99.7  0.3

# Visualize
ggplot(df, aes(x = factor(is_fraud), fill = factor(is_fraud))) +
    geom_bar() +
    scale_fill_manual(
        values = c("#2ECC71", "#E74C3C"),
        labels = c("Legitimate", "Fraudulent")
    ) +
    labs(
        title = "Distribution of Transactions by Fraud Status",
        x = "Transaction Type",
        y = "Count",
        fill = "Status"
    ) +
    theme_minimal() +
    theme(legend.position = "top")

**Result: ** Fraud represents a small percentage of total transactions, which is typical in fraud detection scenarios. This is why unsupervised learning is valuable we can’t rely solely on labeled examples when fraud is so rare.

3.1b Raw Transaction Amount Distribution

Before using the log transformation on the amount column, I want to see how the data is distributed.

What is a log transformation?

Log transformation is a mathematical technique used to compress data that spans a huge range (like transaction amounts) so that small values and massive values can be visualized together on the same scale.

# Visualize raw transaction amounts (No Log Scale)
ggplot(df, aes(x = amount)) +
    geom_histogram(bins = 50, fill = "#E74C3C", alpha = 0.7) +
    scale_x_continuous(labels = scales::comma) +
    labs(
        title = "Raw Distribution of Transaction Amounts",
        subtitle = "Notice how 'skewed' the data looks without scaling",
        x = "Amount (₦)",
        y = "Count"
    ) +
    theme_minimal()

Interpretation.

From the visualizaton, this plot is heavily right-skewed.

Almost all transactions are crammed into the single tall bar on the left (small amounts).
The “outliers” (million-naira transactions) are so spread out on the right that they are invisible.
Problem: We can’t see the pattern of normal user behavior here.
Solution: This is why we would be useing the Log Scale in the next step (3.2) to “unpack” that tall bar and reveal the bell curve hidden inside.

3.2 Transaction Amount Distribution (Log Scale)

# Visualize transaction amounts
ggplot(df, aes(x = amount)) +
    geom_histogram(bins = 50, fill = "#3498DB", alpha = 0.7) +
    scale_x_log10(labels = scales::comma) +
    labs(
        title = "Distribution of Transaction Amounts (Log Scale)",
        x = "Amount (₦)",
        y = "Count"
    ) +
    theme_minimal()

# Summary statistics
cat("\nAmount Statistics:\n")

## 
## Amount Statistics:

summary(df$amount)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.686e+02 2.800e+04 6.668e+04 1.570e+05 1.595e+05 1.793e+07

What this plot is saying: This histogram shows the Distribution of Transaction Amounts on a Log Scale.

The “Bell Curve”: The data follows a Log-Normal distribution, which is a healthy sign of consistent consumer behavior.
The “Normal” Range: The peak (tallest bars) is between ₦10,000 and ₦100,000. This is where the vast majority of legitimate daily transactions occur.
The Anomalies: The “tails” on the far left (<₦1,000) and far right (>₦10,000,000) are very thin. These represent rare events.

Analysis: Transaction amounts vary widely, from small everyday purchases to large transfers. The log scale visualization reveals that most transactions cluster in the lower range, with outliers representing unusually large amounts. These extreme values are prime candidates for anomaly detection.

3.3 Fraud by Channel

# Analyze fraud rate across different channels
channel_fraud <- df %>%
    group_by(channel) %>%
    summarise(
        total_tx = n(),
        fraud_tx = sum(is_fraud),
        fraud_rate = fraud_tx/3000 * 100
    ) %>%
    arrange(desc(fraud_rate))
channel_fraud

## # A tibble: 6 × 4
##   channel total_tx fraud_tx fraud_rate
##   <chr>      <int>    <int>      <dbl>
## 1 Mobile    449522     1496     49.9  
## 2 Web       200488      687     22.9  
## 3 POS       180035      551     18.4  
## 4 IB         99653      168      5.6  
## 5 ECOM       50227       76      2.53 
## 6 ATM        20075       22      0.733

# Visualize only the Fraud Rate
ggplot(channel_fraud, aes(x = reorder(channel, fraud_rate), y = fraud_rate)) +
    geom_bar(stat = "identity", fill = "#E74C3C", alpha = 0.8) +
    geom_text(aes(label = sprintf("%.2f%%", fraud_rate)), hjust = -0.1) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + # Add space for labels
    labs(
        title = "Fraud Rate by Transaction Channel",
        subtitle = "Which channels are most vulnerable to attacks?",
        x = "Channel",
        y = "Fraud Rate (%)"
    ) +
    theme_minimal() +
    coord_flip()

What this plot tells us: This chart ranks which banking platforms are most “friendly” to fraudsters.

The High-Risk Zone (Top 3):
- **Mobile (49.87%) ** and Web (22.90%) lead the pack. This confirms that remote/digital fraud is the modern threat. Fraudsters prefer these “card-not-present” channels because they can attack from anywhere in the world without physical risk.
- POS (18.37%) is close behind, representing physical risks like card cloning or theft at merchant points.
The Safe Zone:
- ATM (0.73%) is the safest. Why? Because it requires Two-Factor Authentication by Default (Physical Card + PIN). It’s much harder to hack an ATM remotely than a website.

3.4 Temporal Patterns: When do fraudsters strike?

# Extract hour for visualization
df_temp <- df %>%
    mutate(hour = as.numeric(format(timestamp, "%H")))

# Calculate fraud rate by hour
hourly_fraud <- df_temp %>%
    group_by(hour) %>%
    summarise(
        total = n(),
        fraud_count = sum(is_fraud),
        fraud_rate = mean(is_fraud) * 100
    )

# Visualize fraud rate over the day
ggplot(hourly_fraud, aes(x = hour, y = fraud_rate)) +
    geom_line(color = "#E74C3C", size = 1.2) +
    geom_point(color = "#C0392B", size = 3) +
    scale_x_continuous(breaks = 0:23) +
    labs(
        title = "Hourly Fraud Trends",
        subtitle = "Are attacks more common at night?",
        x = "Hour of Day (0-23)",
        y = "Fraud Rate (%)"
    ) +
    theme_minimal() +
    theme(panel.grid.minor = element_blank())

What this plot tells us: This chart tracks the “Fraud Schedule” over a 24-hour cycle.

The “Witching Hour” Peaks (1 AM & 4 AM):
- Notice the sharp spikes at 01:00 and 04:00. The fraud rate hits its absolute highest levels while the world sleeps.
- Why? Criminals know that victims won’t see transaction alerts at 2 AM, giving them a head start before the card is blocked.
The Evening Surge (9 PM):
- Another peak appears around 21:00 (9 PM). This is often “social engineering prime time” or the end of business days for international fraudsters.
Implication:
- Time is a Risk Factor. A large transfer happening at 10:00 AM might be business as usual. The same transfer at 01:00 AM is inherently suspicious. Our model learns this temporal risk.

3.5 Velocity Analysis: The “Speed” of Fraud

One of the strongest indicators of fraud is “velocity” how fast transactions are occurring. Fraudsters often try to drain an account quickly before being blocked.

# Compare velocity scores
ggplot(df, aes(x = factor(is_fraud), y = velocity_score, fill = factor(is_fraud))) +
    geom_boxplot(outlier.alpha = 0.3) +
    scale_fill_manual(
        values = c("#2ECC71", "#E74C3C"),
        labels = c("Legitimate", "Fraudulent")
    ) +
    scale_y_log10() + # Use log scale because velocity varies widely
    labs(
        title = "Transaction Velocity Distribution",
        subtitle = "Do fraudsters transact faster than normal users?",
        x = "Status",
        y = "Velocity Score (Log Scale)",
        fill = "Status"
    ) +
    theme_minimal()

What this plot tells us : - They look alike: Contrary to expectation, the median velocity (the line inside the box) for fraudsters is interestingly similar to legitimate users. - Why? Sophisticated fraudsters know that “speed kills” (their chances of success). They intentionally slow down their attacks to mimic normal human behavior and bypass simple velocity rules. - Conclusion: We cannot catch these advanced fraudsters just by measuring speed. This confirms why we need a complex algorithm like Isolation Forest—to find the subtle combinations of anomalies (e.g., normal speed + weird time + huge amount) that simple rules miss.

3.6 Bank Vulnerability Analysis

Are certain banks targeted more frequently? This helps identify if specific institutions have security loopholes being exploited.

# Calculate fraud rate by bank (Top 10 banks by volume)
bank_stats <- df %>% 
    group_by(bank) %>%
    summarise(
        volume = n(),
        fraud_count = sum(is_fraud),
        fraud_rate = fraud_count/3000 * 100
    ) %>%
    arrange(desc(volume))

ggplot(bank_stats, aes(x = reorder(bank, fraud_rate), y = fraud_rate)) +
    geom_bar(stat = "identity", fill = "#8E44AD", alpha = 0.8) +
    coord_flip() +
    labs(
        title = "Fraud Rate by Bank (Top 10 by Volume)",
        subtitle = "Which institutions are being targeted?",
        x = "Bank",
        y = "Fraud Rate (%)"
    ) +
    theme_minimal()

Interpretation:

The chart reveals a uniform fraud rate of ~0.3% across all top banks, indicating that attackers are targeting the entire banking infrastructure equally rather than exploiting a specific institution’s weakness.

3.7 Risk Correlation Analysis:

There are several “Risk Scores” in the data (Velocity, Merchant Risk, Composite Risk). This analysis checks if they are telling the same story.

what this Risks actually mean:

Merchant Risk Score (The “Shop” Score): This measures “How sketchy is the store?” If you buy from a verified shop (like Jumia), this score is Low. If you transfer money to a gambling site or a brand new unverified website, this score acts High.

Composite Risk (The “Overall” Score): This is the bank’s “Old Alarm System.” It is a single number that tries to sum up everything (Velocity + Merchant Risk + Location + Device ID). Think of it like a student’s GPA. It combines Math, English, and Science into one grade.

# Select only the numerical risk scores
risk_data <- df %>%
    select(velocity_score, merchant_risk_score, composite_risk, amount)

# Calculate the correlation matrix
cor_matrix <- cor(risk_data)

# Reshape for plotting (making it tidy)
cor_melted <- as.data.frame(as.table(cor_matrix))

# Plot Heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = round(Freq, 2)), color = "white", size = 4) +
    scale_fill_gradient2(
        low = "#3498DB", high = "#E74C3C", mid = "white",
        midpoint = 0, limit = c(-1, 1), name = "Correlation"
    ) +
    labs(
        title = "Correlation Heatmap of Risk Factors",
        subtitle = "1.0 = Perfect Match, 0.0 = No Relationship",
        x = "", y = ""
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key Observations:

Merchant Risk & Composite Risk (0.82):
- The correlation between Merchant Risk and Composite Risk is very high (0.82).
- Meaning: “Where you shop” matters most. If a customer transacts with a blacklisted merchant, their overall risk score spikes immediately.
The “Velocity & Amount” Correlation (0.58):
- There is a moderate correlation (0.58) between Velocity and Amount.
- Meaning: When transaction speed increases, the amounts tend to get larger. This is the classic pattern where fraudsters try to drain an account quickly.
The Independent Signals (White Spaces):
- Notice the white spaces (e.g., Velocity vs Merchant Risk). They are uncorrelated.
- Why is this good? It means the features are not redundant. A fraudster might attack slowly (Low Velocity) but at a Risky Merchant, or quickly (High Velocity) at a Safe Merchant. Our model needs both distinct signals to see the full picture.

3.8 Scatter plot between Composite risk score vs Transaction amount:

This chart compares the bank’s existing “Composite Risk Score” (X-axis) against the Transaction Amount (Y-axis).

# We take a sample to avoid overplotting
set.seed(123)
sample_df <- df %>% sample_n(10000)

ggplot(sample_df, aes(x = composite_risk, y = amount, color = factor(is_fraud))) +
    geom_point(alpha = 0.6, size = 2) +
    scale_y_log10(labels = scales::comma) + # Log scale for amount
    scale_color_manual(
        values = c("#BDC3C7", "#E74C3C"),
        labels = c("Legitimate", "Fraudulent")
    ) +
    labs(
        title = "The Blind Spots: Amount vs. Risk Score",
        subtitle = "Red dots on the LEFT are frauds that bypassed the current risk system",
        x = "Composite Risk Score (Current System)",
        y = "Transaction Amount (Log Scale)",
        color = "Status"
    ) +
    theme_minimal() +
    geom_vline(xintercept = 0.5, linetype = "dashed", color = "black") +
    annotate("text", x = 0.2, y = max(sample_df$amount), label = "Low Risk Zone", hjust = 0.5) +
    annotate("text", x = 0.8, y = max(sample_df$amount), label = "High Risk Zone", hjust = 0.5)

Insight: - The Dotted Line (Threshold): The vertical line at 0.5 represents the typical cutoff for flagging a transaction. Anything to the right is “Risky”, anything to the left is “Safe”. - The Critical Finding: Notice the significant cluster of Red Dots (Fraud) on the Left Side of the dotted line. - Interpretation: These are high-value fraudulent transactions (some exceeding ₦1,000,000) that have Low Risk Scores (0.0 - 0.2). - Conclusion: The current rule-based risk system is failing. It consistently misclassifies these high-value frauds as “Low Risk,” allowing them to pass through undetected. This failure demonstrates exactly why a more sophisticated model like Isolation Forest is required.

3.9 Victim Profile & Attack Methods: Who and How?

Understanding who is being targeted and how they are being attacked is crucial for tailored prevention.

# 1. Analyze Age Group Vulnerability
age_fraud <- df %>%
    group_by(age_group) %>%
    summarise(
        total = n(),
        fraud_count = sum(is_fraud),
        fraud_rate = fraud_count/3000 * 100
    )

p1 <- ggplot(age_fraud, aes(x = age_group, y = fraud_rate, fill = age_group)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    geom_text(aes(label = sprintf("%.2f%%", fraud_rate)), vjust = -0.5) +
    labs(
        title = "Fraud Rate by Age Group",
        subtitle = "Who is the primary target?",
        x = "Age Group",
        y = "Fraud Rate (%)"
    ) +
    theme_minimal() +
    theme(legend.position = "none")

# 2. Analyze Fraud Techniques
technique_fraud <- df %>%
    filter(is_fraud == 1) %>%
    count(fraud_technique) %>%
    mutate(perc = n / sum(n) * 100)

p2 <- ggplot(technique_fraud, aes(x = reorder(fraud_technique, n), y = n, fill = fraud_technique)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    geom_text(aes(label = n), hjust = 0.5) +
    labs(
        title = "Most Common Fraud Techniques",
        subtitle = "How are they stealing the money?",
        x = "Technique",
        y = "Number of Incidents"
    ) +
    theme_minimal() +
    coord_flip() +
    theme(legend.position = "none")

# Print plots
print(p1)

print(p2)

What these plots tell us (The Human Factor):

The “Social Engineering” Epidemic:
- The Pink Bar is massive. Social Engineering (1,929 incidents) dwarfs all other methods combined.
- Meaning: The biggest vulnerability in the Nigerian banking system isn’t code; it’s people. Fraudsters are manipulating customers into revealing their OTPs or PINs.
- Strategic Fix: While Isolation Forest can detect the abnormal transactions resulting from this, banks must aggressively educate customers to stop the leak at the source.

Step 4: Feature Engineering for Unsupervised Learning

Why this step matters: The Isolation Forest algorithm requires numerical inputs to calculate distances and partitions. there is a need to transform our data into a purely numerical matrix, extracting meaningful features and encoding categorical variables.

# Prepare features for the model
# CRITICAL: We exclude 'is_fraud' because this is unsupervised learning
# The robot must learn to detect fraud without being told what fraud looks like

# We select the most relevant features for detection
features_to_use <- df %>%
    select(
        amount, hour, day_of_week, channel, bank,
        velocity_score, merchant_risk_score, composite_risk,
        tx_count_24h, amount_vs_mean_ratio
    ) %>%
    mutate(across(where(is.character), as.factor))

# Convert to numerical matrix (one-hot encoding for categorical variables)
# This converts "Bank A" into a column of 1s and 0s
model_data <- model.matrix(~ . - 1, data = features_to_use)
head(model_data)

##      amount hour day_of_week channelATM channelECOM channelIB channelMobile
## 1  32266.83    4           5          0           0         0             1
## 2  72530.49   11           1          0           0         0             0
## 3 168152.87    2           6          0           0         0             1
## 4  16439.93    8           1          0           0         0             1
## 5   9922.68   15           2          0           0         0             0
## 6  80685.56   16           2          0           0         0             0
##   channelPOS channelWeb bankFCMB bankFidelity bankFirstBank bankGTBank
## 1          0          0        0            0             0          0
## 2          0          1        0            0             0          0
## 3          0          0        0            0             0          0
## 4          0          0        1            0             0          0
## 5          1          0        0            0             1          0
## 6          0          1        0            0             0          1
##   bankSterling bankUBA bankUnion bankWema bankZenith velocity_score
## 1            1       0         0        0          0     0.18936948
## 2            0       1         0        0          0     0.42567123
## 3            0       0         0        1          0     0.98686550
## 4            0       0         0        0          0     0.09648363
## 5            0       0         0        0          0     0.05823481
## 6            0       0         0        0          0     0.47353218
##   merchant_risk_score composite_risk tx_count_24h amount_vs_mean_ratio
## 1           0.2149999     0.07055978            1           0.18936948
## 2           0.8774244     0.27684880            1           0.42567123
## 3           0.4402304     0.16364883            1           0.98686550
## 4           0.8774244     0.26631480            1           0.09648363
## 5           0.2312907     0.07125073            1           0.05823481
## 6           0.6084928     0.19770088            1           0.47353218

cat("Feature matrix dimensions:", dim(model_data), "\n")

## Feature matrix dimensions: 1000000 23

cat("Features used:", ncol(model_data), "\n")

## Features used: 23

Result & Rationale (Why These Features?):

We transformed the raw dataset into a numerical feature matrix. We carefully selected 10 High-Signal Features based on our EDA findings:

Behavioral Features (The “How”):
- velocity_score & tx_count_24h: Captures Speed. (Fraudsters move faster).
- amount_vs_mean_ratio: Captures Deviation. (Is this transaction 10x your normal spend?).
Contextual Features (The “Where” & “When”):
- merchant_risk_score & composite_risk: The “Smoke Detectors”.
- hour & day_of_week: Captures Timing Risk (The “Witching Hour” from Section 3.4).
- channel: Captures Vulnerability (Section 3.3 showed Mobile/Web are high risk).
Transaction Details (The “What”):
- amount: The core financial value.

Note: We excluded identifying columns like transaction_id and customer_id because unique identifiers do not contain predictive patterns.

Step 5: Building and Training the Anomaly Detection Model

Why this step matters: This is the core of this project. Unlike supervised models that memorize “Fraud vs. Safe” examples, this unsupervised model learns the geometry of normal behavior. It builds a mathematical boundary around the “normal” data points.

How the Training Works: 1. Random Partitioning: The algorithm builds 100 different decision trees (ntrees = 100). 2. Slicing the Data: Each tree randomly selects a feature (e.g., Amount) and a split value to slice the data. 3. The “Isolation” Principle: * Normal Transactions are clustered together. It takes many cuts to isolate a single normal point from the herd. * Anomalies (Fraud) are distinct and “far away” in the feature space. It takes very few cuts to isolate them. * Therefore, Path Length (number of cuts) becomes our proxy for Anomaly Score.

# Train the Isolation Forest model
# ntrees = 100: We build an ensemble of 100 trees to ensure robustness (like a jury of 100 voters)
# sample_size = 256: The optimal subset size to detect anomalies without overfitting
cat("Training the Anomaly Detection Model...\n")

## Training the Anomaly Detection Model...

iso_model <- isolation.forest(model_data, ntrees = 100, sample_size = 256)

# Save the trained model
saveRDS(iso_model, "anomaly_detection_model.rds")
cat("✓ Model trained and saved as 'anomaly_detection_model.rds'\n")

## ✓ Model trained and saved as 'anomaly_detection_model.rds'

Result: The Anomaly Detection Model has been successfully trained on the feature matrix derived from 1,000,000 transactions. It has learned the statistical structure of “normalcy” and is now capable of scoring any new transaction based on how much it deviates from this learned structure. it is saved as an .rds file for deployment in our Shiny application.

Step 6: Generating Anomaly Scores

Why this step matters: The Model now examines every transaction and assigns an anomaly score. This score represents how “unusual” each transaction appears compared to the learned normal behavior.

# 1. SCORING: The Model examines every transaction
# predict() takes the trained 'iso_model' and the feature matrix 'model_data'
# It assigns a score (0.0 to 1.0) to each of the 1,000,000 rows
scores <- predict(iso_model, model_data)

# 2. ASSIGNMENT: We save these scores back into our main dataframe
df$anomaly_score <- scores

# 3. VISUALIZATION: Checking the "Shape of Risk"
# We plot a histogram to see where most transactions land.
# Expectation: A tall peak around 0.45 (Normal) and a long tail to the right (Anomalies).
ggplot(df, aes(x = anomaly_score)) +
    geom_histogram(bins = 50, fill = "#E74C3C", alpha = 0.7) +
    labs(
        title = "Distribution of Anomaly Scores",
        subtitle = "Higher scores indicate higher likelihood of fraud",
        x = "Anomaly Score",
        y = "Count"
    ) +
    theme_minimal()

# 4. STATISTICS: Establishing the Baseline
# Min/Mean/Max help us define what is "High Risk."
cat("\nAnomaly Score Statistics:\n")

## 
## Anomaly Score Statistics:

summary(df$anomaly_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3502  0.4182  0.4400  0.4432  0.4634  0.6926

Interpretation of Results (What the Plot Shows): * The “Normal” Hill: There is a massive bell curve centered around 0.40 - 0.45. This represents 99% of legitimate customers. * The “Danger” Tail: To the far right (scores >0.60), the bars get very short. These are the rare, isolated anomalies. * The Strategy: Our job is to draw a line (Threshold) to separate the Hill from the Tail. Anything to the right of that line gets blocked.

What the scores mean: - Score near 0.5: Highly normal transaction - Score near 1.0: Highly anomalous (potential fraud)

The histogram shows most transactions cluster at lower scores (normal behavior), with a tail extending toward higher scores (potential anomalies).

Step 7: Evaluation - Validating Model Performance

Why this step matters: In a real-world “Unsupervised” scenario,there wont be answers (labels). But for this research, there is. This allows us to “grade” our Anomaly Detection Model. We want to verify that the model assigns Significantly Higher Scores to actual fraud than to legitimate transactions.

7.1 Comparing Scores by Fraud Status (The Proof)

# 1. GROUPING: Aggregating data by Fraud Status (0 or 1)
# We want to see the difference in 'Average Score' between the two groups.
avg_scores <- df %>%
    group_by(is_fraud) %>%
    summarise(
        mean_score = mean(anomaly_score),
        median_score = median(anomaly_score),
        sd_score = sd(anomaly_score),
        count = n()
    )

cat("Average Anomaly Scores by Fraud Status:\n")

## Average Anomaly Scores by Fraud Status:

print(avg_scores)

## # A tibble: 2 × 5
##   is_fraud mean_score median_score sd_score  count
##      <int>      <dbl>        <dbl>    <dbl>  <int>
## 1        0      0.443        0.440   0.0388 997000
## 2        1      0.446        0.444   0.0372   3000

# 2. VISUALIZATION: Boxplot Comparison
# Boxplots are perfect here because they show the 'Range' of scores.
# If the model is working, the Red Box (Fraud) should be visibly higher up than the Green Box (Legit).
ggplot(df, aes(x = factor(is_fraud), y = anomaly_score, fill = factor(is_fraud))) +
    geom_boxplot() +
    scale_fill_manual(
        values = c("#2ECC71", "#E74C3C"),
        labels = c("Legitimate", "Fraudulent")
    ) +
    labs(
        title = "Anomaly Scores: Legitimate vs Fraudulent Transactions",
        subtitle = "Do fraudsters get higher scores than regular users?",
        x = "Transaction Type (0 = Legitimate, 1 = Fraudulent)",
        y = "Anomaly Score",
        fill = "Status"
    ) +
    theme_minimal() +
    theme(legend.position = "top")

Visual Analysis of the Boxplot (The Truth):

The Overlap: Notice how the Green Box and Red Box are almost at the same level. This visualizes our “Hiding in Plain Sight” finding. The average fraudster is mimicking the average user perfectly.
The Outliers (Black Dots): See the black dots floating above the green box? Those are legitimate users acting strangely (e.g., high spending on a weekend). They look like fraud to the model.
The Challenge: Because the two groups overlap so much, there is no single “Magic Number” to separate them.
Strategic Decision: This is why we set a strict threshold (e.g., top 5%) for the Live Monitor. We prioritize catching the obvious anomalies (the highest dots) while accepting that some sophisticated fraud might slip through to avoid blocking legitimate customers.

7.2 Performance Metrics (The Report Card)

# 1. SETTING THE THRESHOLD
# We decide that the "Top 5%" most suspicious transactions are worth blocking.
# quantile(0.95) finds the score that separates the bottom 95% from the top 5%.
threshold <- quantile(df$anomaly_score, 0.95)

# 2. MAKING PREDICTIONS
# If Score > Threshold -> Predict Fraud (1)
# Else -> Predict Normal (0)
df$predicted_fraud <- ifelse(df$anomaly_score > threshold, 1, 0)

# 3. CONFUSION MATRIX
# This table compares our "Predictions" vs the "Truth".
# It tells us:
# - How many frauds we caught (True Positives)
# - How many innocent people we blocked (False Positives)
conf_matrix <- table(Predicted = df$predicted_fraud, Actual = df$is_fraud)
cat("\nConfusion Matrix:\n")

## 
## Confusion Matrix:

print(conf_matrix)

##          Actual
## Predicted      0      1
##         0 947169   2831
##         1  49831    169

# 4. CALCULATING SCORES
# We break down the matrix into 4 key numbers:
TP <- conf_matrix[2, 2] # True Positives (Hit)
TN <- conf_matrix[1, 1] # True Negatives (Correctly Ignored)
FP <- conf_matrix[2, 1] # False Positives (False Alarm)
FN <- conf_matrix[1, 2] # False Negatives (Miss)

precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("\nPerformance Metrics:\n")

## 
## Performance Metrics:

cat("Precision:", round(precision, 4), "\n")

## Precision: 0.0034

cat("Recall:", round(recall, 4), "\n")

## Recall: 0.0563

cat("F1-Score:", round(f1_score, 4), "\n")

## F1-Score: 0.0064

Interpretation - 169 Real Frauds Caught: The model successfully caught 169 criminals (bottom right). These are attacks that were stopped.

2,831 Frauds Escaped: The model missed 2,831 frauds (top right). It thought they were safe (0), but they were actually fraud (1). Why? These frauds looked too normal for the anomaly detector to notice.
49,831 Innocent People Blocked: The model flagged 49,831 legitimate transactions as fraud (bottom left).
Conclusion: The model is not perfect. It catches some fraud (169 cases), but at the cost of annoying many customers (49k). This proves that the isolation forest model alone is not enough to catch this fraudsters.

What these Scorecard Numbers Mean:

These scores are extremely low.

Precision (0.0034): This means 0.34%. In simple English: For every 1,000 times the model yells “FRAUD!”, it is wrong 997 times. It is almost purely guessing.
Recall (0.0563): This means 5.6%. In simple English: Out of all the actual criminals, the model only caught 5% of them. It let 95% get away.
F1-Score (0.0064): This is the overall grade: 0.6%. Basically, the model failed.

Why did it fail so badly? It is not because the algorithm is broken. It is because of the data property, Fraudulent transactions in this dataset are NOT anomalies. They look exactly like normal transactions (same amount, same speed, same time). So, an “Anomaly Detector” (which looks for weirdness) is not the best tool to use. It’s like trying to find a specific grain of sand on a beach by looking for “the weirdest grain.” They all look like sand.

Conclusion and Recommendation

Conclusion

The most critical finding of this study is that modern digital fraud does not always looks like an anomaly. The Isolation Forest analysis proved that fraudulent transactions are statistically identical to legitimate ones (similar amounts, velocity, and timing).
The Unsupervised Anomaly Detection approach failed to accurately distinguish fraud, achieving only 0.34% Precision and 5.6% Recall. This scientifically demonstrates that relying solely on “deviation from normal” is insufficient for detecting sophisticated banking fraud.
The analysis of attack vectors: Social Engineering. Fraudsters aren’t breaking the system (which would create anomalies); they are hacking the human to authorize “normal” looking transfers.

Recommendations

Customer Education & Social‑Engineering Mitigation Since the analysis shows social engineering as the dominant technique, we can launch a phishing‑awareness campaign:
Periodic SMS/email reminders about never sharing OTPs.
In‑app pop‑ups during high‑risk hours warning users of “possible phishing attempts.”
Measure impact by tracking reductions in the “social‑engineering” fraud count after each campaign.
We can still keep the Unsupervised model completely. Keep it as a “Zero-Day” Detector for the top 1% of truly weird anomalies (which supervised models might miss).

Future Direction

Deep Learning (Autoencoders/LSTMs): Investigate Sequence Models (LSTMs) that can analyze the sequence of user actions (e.g., “User checked balance 5 times -> changed PIN -> transferred ALL money”). This captures intent better than single-transaction analysis.
Graph Neural Networks (GNNs): Implement Network Analysis to detect Money Mule Rings. Instead of looking at one transaction, look at the web of connections. If Account A sends to Account B, and B sends to C immediately, that is a pattern Isolation Forest can’t see, but Graph Analysis can.
Real-Time Behavioral Biometrics: Integrate data on how the user interacts (typing speed, mouse movements) during the transaction. A fraudster manipulating an account behaves differently than the owner, even if the transaction numbers are identical.