Opayinka Mercy O
Real-time Fraud Detection and Cybersecurity in Digital Payments
With the rapid growth of digital payments (mobile money, online banking) in Nigeria, financial institutions are increasingly vulnerable to sophisticated fraud schemes and cyberattacks. Detecting these in real-time, often with limited historical fraud data, is crucial for maintaining trust and financial security. Traditional methods often fail to catch new, evolving fraud patterns.
Detect Fraud in real time.
Leveraging unsupervised learning algorithms (Isolation Forest) to identify unusual transaction patterns that deviate from normal behavior.
Real‑time fraud scoring via an interactive Shiny app
This project focuses on building an automated defense mechanism against payment fraud. Using the Isolation Forest algorithm, the system acts as a “security robot” that learns the complex patterns of legitimate transactions from the NIBSS dataset (1,000,000 records). Unlike traditional rules-based systems, this model detects anomalies based on deviation rather than fixed rules, allowing it to catch novel and zero-day fraud attacks. The solution is fully implemented as a deployable Shiny application that provides real-time scanning, risk scoring, and interactive analytics for fraud investigators.
# Load the NIBSS fraud dataset
df <- read.csv("nibss_fraud_dataset.csv")
# Display basic information
print(dim(df))## [1] 1000000 38
## transaction_id customer_id timestamp amount channel
## 1 TXN_F08A86FFD87C CUST_0002AED1 2023-01-14 04:31:09 32266.83 Mobile
## 2 TXN_C2D08134EC83 CUST_0002AED1 2023-01-17 11:20:13 72530.49 Web
## 3 TXN_B9499111901D CUST_0002AED1 2023-01-22 02:17:46 168152.87 Mobile
## 4 TXN_48DB1D526A3B CUST_0002AED1 2023-01-24 08:18:23 16439.93 Mobile
## 5 TXN_56DB1E28B758 CUST_0002AED1 2023-02-01 15:39:53 9922.68 POS
## 6 TXN_8CB46D78CEED CUST_0002AED1 2023-02-08 16:27:19 80685.56 Web
## merchant_category bank location age_group hour day_of_week month
## 1 Grocery Sterling Other 30-39 4 5 1
## 2 Entertainment UBA Other 30-39 11 1 1
## 3 Transport Wema Other 30-39 2 6 1
## 4 Entertainment FCMB Other 30-39 8 1 1
## 5 Education FirstBank Other 30-39 15 2 2
## 6 Restaurant GTBank Other 30-39 16 2 2
## is_weekend is_peak_hour tx_count_24h amount_sum_24h amount_mean_7d
## 1 True False 1 32266.83 32266.83
## 2 False True 1 72530.49 52398.66
## 3 True False 1 168152.87 120341.68
## 4 False False 1 16439.93 85707.76
## 5 False True 1 9922.68 9922.68
## 6 False True 1 80685.56 80685.56
## amount_std_7d tx_count_total amount_mean_total amount_std_total
## 1 0.00 107 170389.9 365915.9
## 2 20131.83 107 170389.9 365915.9
## 3 47811.19 107 170389.9 365915.9
## 4 62633.51 107 170389.9 365915.9
## 5 0.00 107 170389.9 365915.9
## 6 0.00 107 170389.9 365915.9
## channel_diversity location_diversity amount_vs_mean_ratio
## 1 5 1 0.18936948
## 2 5 1 0.42567123
## 3 5 1 0.98686550
## 4 5 1 0.09648363
## 5 5 1 0.05823481
## 6 5 1 0.47353218
## online_channel_ratio is_fraud fraud_technique hour_sin hour_cos
## 1 0.7757009 0 0.8660254 0.5000000
## 2 0.7757009 0 0.2588190 -0.9659258
## 3 0.7757009 0 0.5000000 0.8660254
## 4 0.7757009 0 0.8660254 -0.5000000
## 5 0.7757009 0 -0.7071068 -0.7071068
## 6 0.7757009 0 -0.8660254 -0.5000000
## day_sin day_cos month_sin month_cos amount_log amount_rounded
## 1 -0.9749279 -0.2225209 0.5000000 0.8660254 10.381826 0
## 2 0.7818315 0.6234898 0.5000000 0.8660254 11.191776 0
## 3 -0.7818315 0.6234898 0.5000000 0.8660254 12.032635 0
## 4 0.7818315 0.6234898 0.5000000 0.8660254 9.707529 0
## 5 0.9749279 -0.2225209 0.8660254 0.5000000 9.202679 0
## 6 0.9749279 -0.2225209 0.8660254 0.5000000 11.298327 0
## velocity_score merchant_risk_score composite_risk
## 1 0.18936948 0.2149999 0.07055978
## 2 0.42567123 0.8774244 0.27684880
## 3 0.98686550 0.4402304 0.16364883
## 4 0.09648363 0.8774244 0.26631480
## 5 0.05823481 0.2312907 0.07125073
## 6 0.47353218 0.6084928 0.19770088
## 'data.frame': 1000000 obs. of 38 variables:
## $ transaction_id : chr "TXN_F08A86FFD87C" "TXN_C2D08134EC83" "TXN_B9499111901D" "TXN_48DB1D526A3B" ...
## $ customer_id : chr "CUST_0002AED1" "CUST_0002AED1" "CUST_0002AED1" "CUST_0002AED1" ...
## $ timestamp : chr "2023-01-14 04:31:09" "2023-01-17 11:20:13" "2023-01-22 02:17:46" "2023-01-24 08:18:23" ...
## $ amount : num 32267 72530 168153 16440 9923 ...
## $ channel : chr "Mobile" "Web" "Mobile" "Mobile" ...
## $ merchant_category : chr "Grocery" "Entertainment" "Transport" "Entertainment" ...
## $ bank : chr "Sterling" "UBA" "Wema" "FCMB" ...
## $ location : chr "Other" "Other" "Other" "Other" ...
## $ age_group : chr "30-39" "30-39" "30-39" "30-39" ...
## $ hour : int 4 11 2 8 15 16 13 16 10 19 ...
## $ day_of_week : int 5 1 6 1 2 2 2 2 3 0 ...
## $ month : int 1 1 1 1 2 2 2 2 2 2 ...
## $ is_weekend : chr "True" "False" "True" "False" ...
## $ is_peak_hour : chr "False" "True" "False" "False" ...
## $ tx_count_24h : num 1 1 1 1 1 1 1 2 3 1 ...
## $ amount_sum_24h : num 32267 72530 168153 16440 9923 ...
## $ amount_mean_7d : num 32267 52399 120342 85708 9923 ...
## $ amount_std_7d : num 0 20132 47811 62634 0 ...
## $ tx_count_total : int 107 107 107 107 107 107 107 107 107 107 ...
## $ amount_mean_total : num 170390 170390 170390 170390 170390 ...
## $ amount_std_total : num 365916 365916 365916 365916 365916 ...
## $ channel_diversity : int 5 5 5 5 5 5 5 5 5 5 ...
## $ location_diversity : int 1 1 1 1 1 1 1 1 1 1 ...
## $ amount_vs_mean_ratio: num 0.1894 0.4257 0.9869 0.0965 0.0582 ...
## $ online_channel_ratio: num 0.776 0.776 0.776 0.776 0.776 ...
## $ is_fraud : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fraud_technique : chr "" "" "" "" ...
## $ hour_sin : num 0.866 0.259 0.5 0.866 -0.707 ...
## $ hour_cos : num 0.5 -0.966 0.866 -0.5 -0.707 ...
## $ day_sin : num -0.975 0.782 -0.782 0.782 0.975 ...
## $ day_cos : num -0.223 0.623 0.623 0.623 -0.223 ...
## $ month_sin : num 0.5 0.5 0.5 0.5 0.866 ...
## $ month_cos : num 0.866 0.866 0.866 0.866 0.5 ...
## $ amount_log : num 10.38 11.19 12.03 9.71 9.2 ...
## $ amount_rounded : int 0 0 0 0 0 0 0 0 0 0 ...
## $ velocity_score : num 0.1894 0.4257 0.9869 0.0965 0.0582 ...
## $ merchant_risk_score : num 0.215 0.877 0.44 0.877 0.231 ...
## $ composite_risk : num 0.0706 0.2768 0.1636 0.2663 0.0713 ...
Dataset Description: This dataset was sourced from Kaggle and is a meticulously crafted synthetic dataset containing 1,000,000 financial transactions. It is specifically calibrated to reflect real Nigerian banking patterns using official NIBSS (Nigerian Interbank Settlement System) 2023 fraud landscape statistics.
Data Dictionary (Key Features): Here is a breakdown of the columns we are working with
transaction_id Unique identifier for each transaction.
customer_id Unique identifier for the account holder.
amount The value of the transaction in Naira (₦).
timestamp Date and time the transaction occurred.
channel The platform used (e.g., Mobile, POS, ATM, Web).
bank The financial institution processing the transaction.
merchant_category The industry of the recipient (e.g., Retail, Utilities).
is_fraud Target Variable: 1 if Fraud, 0 if Legitimate.
velocity_score Calculated metric indicating speed of transactions.
merchant_risk_score Pre-calculated risk level of the merchant.
composite_risk Aggregated risk score combining multiple factors.
tx_count_24h Number of transactions by this customer in last 24h.
amount_vs_mean_ratio How much this amount differs from customer’s average.
fraud_technique The method used for fraud (e.g., Phishing).
location Geographic location of the transaction.
age_group Demographic segment of the customer.
This dataset contains 1,000,000 transactions with 38 features. This substantial volume provides the model with enough examples to learn the complex patterns of normal transaction behavior. Each row represents a single transaction with details including amount, timestamp, channel, bank, customer behavior metrics, risk indicators etc.
Machine learning algorithms require clean, properly formatted data. Missing values or incorrect data types can cause the algorithm to fail or produce unreliable results.
# Check for missing values in each column
missing_counts <- colSums(is.na(df))
print(missing_counts[missing_counts > 0])## named numeric(0)
# Total missing values
total_missing <- sum(is.na(df))
cat("\nTotal missing values:", total_missing, "\n")##
## Total missing values: 0
Result: The dataset is complete with no missing values. This is excellent news, we can proceed without needing imputation strategies that might introduce bias.
# Convert timestamp from character to datetime format
df$timestamp <- as.POSIXct(df$timestamp, format = "%Y-%m-%d %H:%M:%S")
# Verify the conversion
cat("Timestamp class:", class(df$timestamp), "\n")## Timestamp class: POSIXct POSIXt
## Sample timestamps:
## [1] "2023-01-14 04:31:09 WAT" "2023-01-17 11:20:13 WAT"
## [3] "2023-01-22 02:17:46 WAT" "2023-01-24 08:18:23 WAT"
## [5] "2023-02-01 15:39:53 WAT" "2023-02-08 16:27:19 WAT"
## POSIXct[1:1000000], format: "2023-01-14 04:31:09" "2023-01-17 11:20:13" "2023-01-22 02:17:46" ...
Result: The timestamp column is now properly formatted as a POSIXct datetime object. This allows us to extract time-based features and analyze temporal patterns in fraud.
Before building this model, understanding the data’s characteristics is very important. In unsupervised learning,the major interest is identifying potential outliers and understanding the distribution of key features.
# Calculate fraud statistics
fraud_table <- table(df$is_fraud)
fraud_pct <- prop.table(fraud_table) * 100
cat("Fraud Distribution:\n")## Fraud Distribution:
##
## 0 1
## 997000 3000
##
## Percentages:
##
## 0 1
## 99.7 0.3
# Visualize
ggplot(df, aes(x = factor(is_fraud), fill = factor(is_fraud))) +
geom_bar() +
scale_fill_manual(
values = c("#2ECC71", "#E74C3C"),
labels = c("Legitimate", "Fraudulent")
) +
labs(
title = "Distribution of Transactions by Fraud Status",
x = "Transaction Type",
y = "Count",
fill = "Status"
) +
theme_minimal() +
theme(legend.position = "top")**Result: ** Fraud represents a small percentage of total transactions, which is typical in fraud detection scenarios. This is why unsupervised learning is valuable we can’t rely solely on labeled examples when fraud is so rare.
Before using the log transformation on the amount column, I want to see how the data is distributed.
What is a log transformation?
Log transformation is a mathematical technique used to compress data that spans a huge range (like transaction amounts) so that small values and massive values can be visualized together on the same scale.
# Visualize raw transaction amounts (No Log Scale)
ggplot(df, aes(x = amount)) +
geom_histogram(bins = 50, fill = "#E74C3C", alpha = 0.7) +
scale_x_continuous(labels = scales::comma) +
labs(
title = "Raw Distribution of Transaction Amounts",
subtitle = "Notice how 'skewed' the data looks without scaling",
x = "Amount (₦)",
y = "Count"
) +
theme_minimal()Interpretation.
From the visualizaton, this plot is heavily right-skewed.
Almost all transactions are crammed into the single tall bar on the left (small amounts).
The “outliers” (million-naira transactions) are so spread out on the right that they are invisible.
Problem: We can’t see the pattern of normal user behavior here.
Solution: This is why we would be useing the Log Scale in the next step (3.2) to “unpack” that tall bar and reveal the bell curve hidden inside.
# Visualize transaction amounts
ggplot(df, aes(x = amount)) +
geom_histogram(bins = 50, fill = "#3498DB", alpha = 0.7) +
scale_x_log10(labels = scales::comma) +
labs(
title = "Distribution of Transaction Amounts (Log Scale)",
x = "Amount (₦)",
y = "Count"
) +
theme_minimal()##
## Amount Statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.686e+02 2.800e+04 6.668e+04 1.570e+05 1.595e+05 1.793e+07
What this plot is saying: This histogram shows the Distribution of Transaction Amounts on a Log Scale.
Analysis: Transaction amounts vary widely, from small everyday purchases to large transfers. The log scale visualization reveals that most transactions cluster in the lower range, with outliers representing unusually large amounts. These extreme values are prime candidates for anomaly detection.
# Analyze fraud rate across different channels
channel_fraud <- df %>%
group_by(channel) %>%
summarise(
total_tx = n(),
fraud_tx = sum(is_fraud),
fraud_rate = fraud_tx/3000 * 100
) %>%
arrange(desc(fraud_rate))
channel_fraud## # A tibble: 6 × 4
## channel total_tx fraud_tx fraud_rate
## <chr> <int> <int> <dbl>
## 1 Mobile 449522 1496 49.9
## 2 Web 200488 687 22.9
## 3 POS 180035 551 18.4
## 4 IB 99653 168 5.6
## 5 ECOM 50227 76 2.53
## 6 ATM 20075 22 0.733
# Visualize only the Fraud Rate
ggplot(channel_fraud, aes(x = reorder(channel, fraud_rate), y = fraud_rate)) +
geom_bar(stat = "identity", fill = "#E74C3C", alpha = 0.8) +
geom_text(aes(label = sprintf("%.2f%%", fraud_rate)), hjust = -0.1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + # Add space for labels
labs(
title = "Fraud Rate by Transaction Channel",
subtitle = "Which channels are most vulnerable to attacks?",
x = "Channel",
y = "Fraud Rate (%)"
) +
theme_minimal() +
coord_flip()What this plot tells us: This chart ranks which banking platforms are most “friendly” to fraudsters.
# Extract hour for visualization
df_temp <- df %>%
mutate(hour = as.numeric(format(timestamp, "%H")))
# Calculate fraud rate by hour
hourly_fraud <- df_temp %>%
group_by(hour) %>%
summarise(
total = n(),
fraud_count = sum(is_fraud),
fraud_rate = mean(is_fraud) * 100
)
# Visualize fraud rate over the day
ggplot(hourly_fraud, aes(x = hour, y = fraud_rate)) +
geom_line(color = "#E74C3C", size = 1.2) +
geom_point(color = "#C0392B", size = 3) +
scale_x_continuous(breaks = 0:23) +
labs(
title = "Hourly Fraud Trends",
subtitle = "Are attacks more common at night?",
x = "Hour of Day (0-23)",
y = "Fraud Rate (%)"
) +
theme_minimal() +
theme(panel.grid.minor = element_blank())What this plot tells us: This chart tracks the “Fraud Schedule” over a 24-hour cycle.
One of the strongest indicators of fraud is “velocity” how fast transactions are occurring. Fraudsters often try to drain an account quickly before being blocked.
# Compare velocity scores
ggplot(df, aes(x = factor(is_fraud), y = velocity_score, fill = factor(is_fraud))) +
geom_boxplot(outlier.alpha = 0.3) +
scale_fill_manual(
values = c("#2ECC71", "#E74C3C"),
labels = c("Legitimate", "Fraudulent")
) +
scale_y_log10() + # Use log scale because velocity varies widely
labs(
title = "Transaction Velocity Distribution",
subtitle = "Do fraudsters transact faster than normal users?",
x = "Status",
y = "Velocity Score (Log Scale)",
fill = "Status"
) +
theme_minimal()What this plot tells us : - They look alike: Contrary to expectation, the median velocity (the line inside the box) for fraudsters is interestingly similar to legitimate users. - Why? Sophisticated fraudsters know that “speed kills” (their chances of success). They intentionally slow down their attacks to mimic normal human behavior and bypass simple velocity rules. - Conclusion: We cannot catch these advanced fraudsters just by measuring speed. This confirms why we need a complex algorithm like Isolation Forest—to find the subtle combinations of anomalies (e.g., normal speed + weird time + huge amount) that simple rules miss.
Are certain banks targeted more frequently? This helps identify if specific institutions have security loopholes being exploited.
# Calculate fraud rate by bank (Top 10 banks by volume)
bank_stats <- df %>%
group_by(bank) %>%
summarise(
volume = n(),
fraud_count = sum(is_fraud),
fraud_rate = fraud_count/3000 * 100
) %>%
arrange(desc(volume))
ggplot(bank_stats, aes(x = reorder(bank, fraud_rate), y = fraud_rate)) +
geom_bar(stat = "identity", fill = "#8E44AD", alpha = 0.8) +
coord_flip() +
labs(
title = "Fraud Rate by Bank (Top 10 by Volume)",
subtitle = "Which institutions are being targeted?",
x = "Bank",
y = "Fraud Rate (%)"
) +
theme_minimal()
Interpretation:
The chart reveals a uniform fraud rate of ~0.3% across all top banks, indicating that attackers are targeting the entire banking infrastructure equally rather than exploiting a specific institution’s weakness.
There are several “Risk Scores” in the data (Velocity, Merchant Risk, Composite Risk). This analysis checks if they are telling the same story.
what this Risks actually mean:
Merchant Risk Score (The “Shop” Score): This measures “How sketchy is the store?” If you buy from a verified shop (like Jumia), this score is Low. If you transfer money to a gambling site or a brand new unverified website, this score acts High.
Composite Risk (The “Overall” Score): This is the bank’s “Old Alarm System.” It is a single number that tries to sum up everything (Velocity + Merchant Risk + Location + Device ID). Think of it like a student’s GPA. It combines Math, English, and Science into one grade.
# Select only the numerical risk scores
risk_data <- df %>%
select(velocity_score, merchant_risk_score, composite_risk, amount)
# Calculate the correlation matrix
cor_matrix <- cor(risk_data)
# Reshape for plotting (making it tidy)
cor_melted <- as.data.frame(as.table(cor_matrix))
# Plot Heatmap
ggplot(cor_melted, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
geom_text(aes(label = round(Freq, 2)), color = "white", size = 4) +
scale_fill_gradient2(
low = "#3498DB", high = "#E74C3C", mid = "white",
midpoint = 0, limit = c(-1, 1), name = "Correlation"
) +
labs(
title = "Correlation Heatmap of Risk Factors",
subtitle = "1.0 = Perfect Match, 0.0 = No Relationship",
x = "", y = ""
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Key Observations:
This chart compares the bank’s existing “Composite Risk Score” (X-axis) against the Transaction Amount (Y-axis).
# We take a sample to avoid overplotting
set.seed(123)
sample_df <- df %>% sample_n(10000)
ggplot(sample_df, aes(x = composite_risk, y = amount, color = factor(is_fraud))) +
geom_point(alpha = 0.6, size = 2) +
scale_y_log10(labels = scales::comma) + # Log scale for amount
scale_color_manual(
values = c("#BDC3C7", "#E74C3C"),
labels = c("Legitimate", "Fraudulent")
) +
labs(
title = "The Blind Spots: Amount vs. Risk Score",
subtitle = "Red dots on the LEFT are frauds that bypassed the current risk system",
x = "Composite Risk Score (Current System)",
y = "Transaction Amount (Log Scale)",
color = "Status"
) +
theme_minimal() +
geom_vline(xintercept = 0.5, linetype = "dashed", color = "black") +
annotate("text", x = 0.2, y = max(sample_df$amount), label = "Low Risk Zone", hjust = 0.5) +
annotate("text", x = 0.8, y = max(sample_df$amount), label = "High Risk Zone", hjust = 0.5)Insight: - The Dotted Line (Threshold): The vertical line at 0.5 represents the typical cutoff for flagging a transaction. Anything to the right is “Risky”, anything to the left is “Safe”. - The Critical Finding: Notice the significant cluster of Red Dots (Fraud) on the Left Side of the dotted line. - Interpretation: These are high-value fraudulent transactions (some exceeding ₦1,000,000) that have Low Risk Scores (0.0 - 0.2). - Conclusion: The current rule-based risk system is failing. It consistently misclassifies these high-value frauds as “Low Risk,” allowing them to pass through undetected. This failure demonstrates exactly why a more sophisticated model like Isolation Forest is required.
Understanding who is being targeted and how they are being attacked is crucial for tailored prevention.
# 1. Analyze Age Group Vulnerability
age_fraud <- df %>%
group_by(age_group) %>%
summarise(
total = n(),
fraud_count = sum(is_fraud),
fraud_rate = fraud_count/3000 * 100
)
p1 <- ggplot(age_fraud, aes(x = age_group, y = fraud_rate, fill = age_group)) +
geom_bar(stat = "identity", alpha = 0.8) +
geom_text(aes(label = sprintf("%.2f%%", fraud_rate)), vjust = -0.5) +
labs(
title = "Fraud Rate by Age Group",
subtitle = "Who is the primary target?",
x = "Age Group",
y = "Fraud Rate (%)"
) +
theme_minimal() +
theme(legend.position = "none")
# 2. Analyze Fraud Techniques
technique_fraud <- df %>%
filter(is_fraud == 1) %>%
count(fraud_technique) %>%
mutate(perc = n / sum(n) * 100)
p2 <- ggplot(technique_fraud, aes(x = reorder(fraud_technique, n), y = n, fill = fraud_technique)) +
geom_bar(stat = "identity", alpha = 0.8) +
geom_text(aes(label = n), hjust = 0.5) +
labs(
title = "Most Common Fraud Techniques",
subtitle = "How are they stealing the money?",
x = "Technique",
y = "Number of Incidents"
) +
theme_minimal() +
coord_flip() +
theme(legend.position = "none")
# Print plots
print(p1)What these plots tell us (The Human Factor):
Why this step matters: The Isolation Forest algorithm requires numerical inputs to calculate distances and partitions. there is a need to transform our data into a purely numerical matrix, extracting meaningful features and encoding categorical variables.
# Prepare features for the model
# CRITICAL: We exclude 'is_fraud' because this is unsupervised learning
# The robot must learn to detect fraud without being told what fraud looks like
# We select the most relevant features for detection
features_to_use <- df %>%
select(
amount, hour, day_of_week, channel, bank,
velocity_score, merchant_risk_score, composite_risk,
tx_count_24h, amount_vs_mean_ratio
) %>%
mutate(across(where(is.character), as.factor))
# Convert to numerical matrix (one-hot encoding for categorical variables)
# This converts "Bank A" into a column of 1s and 0s
model_data <- model.matrix(~ . - 1, data = features_to_use)
head(model_data)## amount hour day_of_week channelATM channelECOM channelIB channelMobile
## 1 32266.83 4 5 0 0 0 1
## 2 72530.49 11 1 0 0 0 0
## 3 168152.87 2 6 0 0 0 1
## 4 16439.93 8 1 0 0 0 1
## 5 9922.68 15 2 0 0 0 0
## 6 80685.56 16 2 0 0 0 0
## channelPOS channelWeb bankFCMB bankFidelity bankFirstBank bankGTBank
## 1 0 0 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 1 0 0 0 1 0
## 6 0 1 0 0 0 1
## bankSterling bankUBA bankUnion bankWema bankZenith velocity_score
## 1 1 0 0 0 0 0.18936948
## 2 0 1 0 0 0 0.42567123
## 3 0 0 0 1 0 0.98686550
## 4 0 0 0 0 0 0.09648363
## 5 0 0 0 0 0 0.05823481
## 6 0 0 0 0 0 0.47353218
## merchant_risk_score composite_risk tx_count_24h amount_vs_mean_ratio
## 1 0.2149999 0.07055978 1 0.18936948
## 2 0.8774244 0.27684880 1 0.42567123
## 3 0.4402304 0.16364883 1 0.98686550
## 4 0.8774244 0.26631480 1 0.09648363
## 5 0.2312907 0.07125073 1 0.05823481
## 6 0.6084928 0.19770088 1 0.47353218
## Feature matrix dimensions: 1000000 23
## Features used: 23
Result & Rationale (Why These Features?):
We transformed the raw dataset into a numerical feature matrix. We carefully selected 10 High-Signal Features based on our EDA findings:
velocity_score & tx_count_24h:
Captures Speed. (Fraudsters move faster).amount_vs_mean_ratio: Captures
Deviation. (Is this transaction 10x your normal
spend?).merchant_risk_score & composite_risk:
The “Smoke Detectors”.hour & day_of_week: Captures
Timing Risk (The “Witching Hour” from Section
3.4).channel: Captures Vulnerability
(Section 3.3 showed Mobile/Web are high risk).amount: The core financial value.Note: We excluded identifying columns like
transaction_id and customer_id because unique
identifiers do not contain predictive patterns.
Why this step matters: This is the core of this project. Unlike supervised models that memorize “Fraud vs. Safe” examples, this unsupervised model learns the geometry of normal behavior. It builds a mathematical boundary around the “normal” data points.
How the Training Works: 1. Random
Partitioning: The algorithm builds 100 different decision trees
(ntrees = 100). 2. Slicing the Data: Each
tree randomly selects a feature (e.g., Amount) and a split value to
slice the data. 3. The “Isolation” Principle: *
Normal Transactions are clustered together. It takes
many cuts to isolate a single normal point from the herd. *
Anomalies (Fraud) are distinct and “far away” in the
feature space. It takes very few cuts to isolate them. *
Therefore, Path Length (number of cuts) becomes our
proxy for Anomaly Score.
# Train the Isolation Forest model
# ntrees = 100: We build an ensemble of 100 trees to ensure robustness (like a jury of 100 voters)
# sample_size = 256: The optimal subset size to detect anomalies without overfitting
cat("Training the Anomaly Detection Model...\n")## Training the Anomaly Detection Model...
iso_model <- isolation.forest(model_data, ntrees = 100, sample_size = 256)
# Save the trained model
saveRDS(iso_model, "anomaly_detection_model.rds")
cat("✓ Model trained and saved as 'anomaly_detection_model.rds'\n")## ✓ Model trained and saved as 'anomaly_detection_model.rds'
Result: The Anomaly Detection Model
has been successfully trained on the feature matrix derived from
1,000,000 transactions. It has learned the statistical structure of
“normalcy” and is now capable of scoring any new transaction based on
how much it deviates from this learned structure. it is saved as an
.rds file for deployment in our Shiny application.
Why this step matters: The Model now examines every transaction and assigns an anomaly score. This score represents how “unusual” each transaction appears compared to the learned normal behavior.
# 1. SCORING: The Model examines every transaction
# predict() takes the trained 'iso_model' and the feature matrix 'model_data'
# It assigns a score (0.0 to 1.0) to each of the 1,000,000 rows
scores <- predict(iso_model, model_data)
# 2. ASSIGNMENT: We save these scores back into our main dataframe
df$anomaly_score <- scores
# 3. VISUALIZATION: Checking the "Shape of Risk"
# We plot a histogram to see where most transactions land.
# Expectation: A tall peak around 0.45 (Normal) and a long tail to the right (Anomalies).
ggplot(df, aes(x = anomaly_score)) +
geom_histogram(bins = 50, fill = "#E74C3C", alpha = 0.7) +
labs(
title = "Distribution of Anomaly Scores",
subtitle = "Higher scores indicate higher likelihood of fraud",
x = "Anomaly Score",
y = "Count"
) +
theme_minimal()# 4. STATISTICS: Establishing the Baseline
# Min/Mean/Max help us define what is "High Risk."
cat("\nAnomaly Score Statistics:\n")##
## Anomaly Score Statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3502 0.4182 0.4400 0.4432 0.4634 0.6926
Interpretation of Results (What the Plot Shows): * The “Normal” Hill: There is a massive bell curve centered around 0.40 - 0.45. This represents 99% of legitimate customers. * The “Danger” Tail: To the far right (scores >0.60), the bars get very short. These are the rare, isolated anomalies. * The Strategy: Our job is to draw a line (Threshold) to separate the Hill from the Tail. Anything to the right of that line gets blocked.
What the scores mean: - Score near 0.5: Highly normal transaction - Score near 1.0: Highly anomalous (potential fraud)
The histogram shows most transactions cluster at lower scores (normal behavior), with a tail extending toward higher scores (potential anomalies).
Why this step matters: In a real-world “Unsupervised” scenario,there wont be answers (labels). But for this research, there is. This allows us to “grade” our Anomaly Detection Model. We want to verify that the model assigns Significantly Higher Scores to actual fraud than to legitimate transactions.
# 1. GROUPING: Aggregating data by Fraud Status (0 or 1)
# We want to see the difference in 'Average Score' between the two groups.
avg_scores <- df %>%
group_by(is_fraud) %>%
summarise(
mean_score = mean(anomaly_score),
median_score = median(anomaly_score),
sd_score = sd(anomaly_score),
count = n()
)
cat("Average Anomaly Scores by Fraud Status:\n")## Average Anomaly Scores by Fraud Status:
## # A tibble: 2 × 5
## is_fraud mean_score median_score sd_score count
## <int> <dbl> <dbl> <dbl> <int>
## 1 0 0.443 0.440 0.0388 997000
## 2 1 0.446 0.444 0.0372 3000
# 2. VISUALIZATION: Boxplot Comparison
# Boxplots are perfect here because they show the 'Range' of scores.
# If the model is working, the Red Box (Fraud) should be visibly higher up than the Green Box (Legit).
ggplot(df, aes(x = factor(is_fraud), y = anomaly_score, fill = factor(is_fraud))) +
geom_boxplot() +
scale_fill_manual(
values = c("#2ECC71", "#E74C3C"),
labels = c("Legitimate", "Fraudulent")
) +
labs(
title = "Anomaly Scores: Legitimate vs Fraudulent Transactions",
subtitle = "Do fraudsters get higher scores than regular users?",
x = "Transaction Type (0 = Legitimate, 1 = Fraudulent)",
y = "Anomaly Score",
fill = "Status"
) +
theme_minimal() +
theme(legend.position = "top")Visual Analysis of the Boxplot (The Truth):
# 1. SETTING THE THRESHOLD
# We decide that the "Top 5%" most suspicious transactions are worth blocking.
# quantile(0.95) finds the score that separates the bottom 95% from the top 5%.
threshold <- quantile(df$anomaly_score, 0.95)
# 2. MAKING PREDICTIONS
# If Score > Threshold -> Predict Fraud (1)
# Else -> Predict Normal (0)
df$predicted_fraud <- ifelse(df$anomaly_score > threshold, 1, 0)
# 3. CONFUSION MATRIX
# This table compares our "Predictions" vs the "Truth".
# It tells us:
# - How many frauds we caught (True Positives)
# - How many innocent people we blocked (False Positives)
conf_matrix <- table(Predicted = df$predicted_fraud, Actual = df$is_fraud)
cat("\nConfusion Matrix:\n")##
## Confusion Matrix:
## Actual
## Predicted 0 1
## 0 947169 2831
## 1 49831 169
# 4. CALCULATING SCORES
# We break down the matrix into 4 key numbers:
TP <- conf_matrix[2, 2] # True Positives (Hit)
TN <- conf_matrix[1, 1] # True Negatives (Correctly Ignored)
FP <- conf_matrix[2, 1] # False Positives (False Alarm)
FN <- conf_matrix[1, 2] # False Negatives (Miss)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)
cat("\nPerformance Metrics:\n")##
## Performance Metrics:
## Precision: 0.0034
## Recall: 0.0563
## F1-Score: 0.0064
Interpretation - 169 Real Frauds Caught: The model successfully caught 169 criminals (bottom right). These are attacks that were stopped.
2,831 Frauds Escaped: The model missed 2,831 frauds (top right). It thought they were safe (0), but they were actually fraud (1). Why? These frauds looked too normal for the anomaly detector to notice.
49,831 Innocent People Blocked: The model flagged 49,831 legitimate transactions as fraud (bottom left).
Conclusion: The model is not perfect. It catches some fraud (169 cases), but at the cost of annoying many customers (49k). This proves that the isolation forest model alone is not enough to catch this fraudsters.
What these Scorecard Numbers Mean:
These scores are extremely low.
Precision (0.0034): This means 0.34%. In simple English: For every 1,000 times the model yells “FRAUD!”, it is wrong 997 times. It is almost purely guessing.
Recall (0.0563): This means 5.6%. In simple English: Out of all the actual criminals, the model only caught 5% of them. It let 95% get away.
F1-Score (0.0064): This is the overall grade: 0.6%. Basically, the model failed.
Why did it fail so badly? It is not because the algorithm is broken. It is because of the data property, Fraudulent transactions in this dataset are NOT anomalies. They look exactly like normal transactions (same amount, same speed, same time). So, an “Anomaly Detector” (which looks for weirdness) is not the best tool to use. It’s like trying to find a specific grain of sand on a beach by looking for “the weirdest grain.” They all look like sand.
Conclusion
The most critical finding of this study is that modern digital fraud does not always looks like an anomaly. The Isolation Forest analysis proved that fraudulent transactions are statistically identical to legitimate ones (similar amounts, velocity, and timing).
The Unsupervised Anomaly Detection approach failed to accurately distinguish fraud, achieving only 0.34% Precision and 5.6% Recall. This scientifically demonstrates that relying solely on “deviation from normal” is insufficient for detecting sophisticated banking fraud.
The analysis of attack vectors: Social Engineering. Fraudsters aren’t breaking the system (which would create anomalies); they are hacking the human to authorize “normal” looking transfers.
Recommendations
Customer Education & Social‑Engineering Mitigation Since the analysis shows social engineering as the dominant technique, we can launch a phishing‑awareness campaign:
Periodic SMS/email reminders about never sharing OTPs.
In‑app pop‑ups during high‑risk hours warning users of “possible phishing attempts.”
Measure impact by tracking reductions in the “social‑engineering” fraud count after each campaign.
We can still keep the Unsupervised model completely. Keep it as a “Zero-Day” Detector for the top 1% of truly weird anomalies (which supervised models might miss).
Future Direction
Deep Learning (Autoencoders/LSTMs): Investigate Sequence Models (LSTMs) that can analyze the sequence of user actions (e.g., “User checked balance 5 times -> changed PIN -> transferred ALL money”). This captures intent better than single-transaction analysis.
Graph Neural Networks (GNNs): Implement Network Analysis to detect Money Mule Rings. Instead of looking at one transaction, look at the web of connections. If Account A sends to Account B, and B sends to C immediately, that is a pattern Isolation Forest can’t see, but Graph Analysis can.
Real-Time Behavioral Biometrics: Integrate data on how the user interacts (typing speed, mouse movements) during the transaction. A fraudster manipulating an account behaves differently than the owner, even if the transaction numbers are identical.