This is a synthetic dataset of fraudulent transactions, containing 21 columns. Source: https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
fraud_detection_transactions <- read.csv("synthetic_fraud_dataset.csv")
# select only the 6 relevant variables
fraud_data <- fraud_detection_transactions %>%
select(Transaction_Amount, Transaction_Type, Account_Balance,
Risk_Score, Fraud_Label, Daily_Transaction_Count)
head(fraud_data)
## Transaction_Amount Transaction_Type Account_Balance Risk_Score Fraud_Label
## 1 39.79 POS 93213.17 0.8494 0
## 2 1.19 Bank Transfer 75725.25 0.0959 1
## 3 28.96 Online 1588.96 0.8400 1
## 4 254.32 ATM Withdrawal 76807.20 0.7935 1
## 5 31.28 POS 92354.66 0.3819 1
## 6 168.55 Online 33236.94 0.0504 0
## Daily_Transaction_Count
## 1 7
## 2 13
## 3 14
## 4 8
## 5 14
## 6 3
str(fraud_data)
## 'data.frame': 50000 obs. of 6 variables:
## $ Transaction_Amount : num 39.79 1.19 28.96 254.32 31.28 ...
## $ Transaction_Type : chr "POS" "Bank Transfer" "Online" "ATM Withdrawal" ...
## $ Account_Balance : num 93213 75725 1589 76807 92355 ...
## $ Risk_Score : num 0.8494 0.0959 0.84 0.7935 0.3819 ...
## $ Fraud_Label : int 0 1 1 1 1 0 0 1 0 0 ...
## $ Daily_Transaction_Count: int 7 13 14 8 14 3 2 3 7 6 ...
The dataset shows 6 variables, with 50,000 observations. These include: Transaction_Amount (Numeric; Ratio): The amount spent in a single transaction. Unit of measurement: Dollars. Transaction_Type (Categorical; Nominal): The method of transaction (POS, Online, ATM Withdrawal, Bank Transfer.). Account_Balance (Numeric; Ratio): The user’s bank balance before the transaction. Unit of measurement: Dollars. Risk_Score (Numeric; Interval): A fraud probability score between 0 and 1 (higher = riskier). Fraud_Label (Categorical; Nominal): Indicates whether the transaction was fraudulent (1) or not (0). Daily_Transaction_Count (Numeric; Ratio): The total number of transactions a user made on that day.
The unit of measurement is transactions, as each row corresponds to a single transaction.
print("Missing Values per Variable")
## [1] "Missing Values per Variable"
print(colSums(is.na(fraud_detection_transactions)))
## Transaction_ID User_ID
## 0 0
## Transaction_Amount Transaction_Type
## 0 0
## Timestamp Account_Balance
## 0 0
## Device_Type Location
## 0 0
## Merchant_Category IP_Address_Flag
## 0 0
## Previous_Fraudulent_Activity Daily_Transaction_Count
## 0 0
## Avg_Transaction_Amount_7d Failed_Transaction_Count_7d
## 0 0
## Card_Type Card_Age
## 0 0
## Transaction_Distance Authentication_Method
## 0 0
## Risk_Score Is_Weekend
## 0 0
## Fraud_Label
## 0
The printed summary indicates no missing variables.
fraud_data$Transaction_Type <- factor(fraud_data$Transaction_Type)
fraud_data$Fraud_Label <- as.factor(fraud_data$Fraud_Label)
levels(fraud_data$Transaction_Type)
## [1] "ATM Withdrawal" "Bank Transfer" "Online" "POS"
summary(fraud_data)
## Transaction_Amount Transaction_Type Account_Balance Risk_Score
## Min. : 0.00 ATM Withdrawal:12453 Min. : 500.5 Min. :0.0001
## 1st Qu.: 28.68 Bank Transfer :12452 1st Qu.:25356.0 1st Qu.:0.2540
## Median : 69.66 Online :12546 Median :50384.4 Median :0.5022
## Mean : 99.41 POS :12549 Mean :50294.1 Mean :0.5016
## 3rd Qu.: 138.85 3rd Qu.:75115.1 3rd Qu.:0.7495
## Max. :1174.14 Max. :99998.3 Max. :1.0000
## Fraud_Label Daily_Transaction_Count
## 0:33933 Min. : 1.000
## 1:16067 1st Qu.: 4.000
## Median : 7.000
## Mean : 7.485
## 3rd Qu.:11.000
## Max. :14.000
sd(fraud_data$Transaction_Amount)
## [1] 98.68729
sd(fraud_data$Risk_Score)
## [1] 0.2877741
Transaction Amount: Mean: $99.41 (significantly higher than median, suggests right skew) Median: $69.66 (more representative of typical transaction) Mode: Not explicitly provided, but from the histogram appears to be in the $0-50 range Standard Deviation: $98.69 (very close to the mean, indicating high variability) Range: $0 to $1,174.14 (extremely wide range with outliers)
Risk Score: Mean: 0.5016 (almost exactly 0.5, suggesting balanced scoring) Median: 0.5022 (remarkably close to mean) Quartiles: 0.2540 (Q1) and 0.7495 (Q3) Standard deviation: 0.2877741, shows moderate variability around the mean, with approximately 68% of all transactions having risk scores between 0.214 and 0.789.
Fraud Label Distribution Non-fraud (0): 33,933 transactions (67.87%) Fraud (1): 16,067 transactions (32.13%)
fraud_stats <- fraud_data %>%
group_by(Fraud_Label) %>%
summarize(
Mean_Amount = mean(Transaction_Amount),
Median_Amount = median(Transaction_Amount),
SD_Amount = sd(Transaction_Amount),
Min_Amount = min(Transaction_Amount),
Max_Amount = max(Transaction_Amount),
Mean_Risk = mean(Risk_Score),
Median_Risk = median(Risk_Score),
SD_Risk = sd(Risk_Score),
Min_Risk = min(Risk_Score),
Max_Risk = max(Risk_Score),
Count = n()
)
print(fraud_stats)
## # A tibble: 2 × 12
## Fraud_Label Mean_Amount Median_Amount SD_Amount Min_Amount Max_Amount
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 99.3 69.4 98.5 0.01 1174.
## 2 1 99.7 70.1 99.2 0 1005.
## # ℹ 6 more variables: Mean_Risk <dbl>, Median_Risk <dbl>, SD_Risk <dbl>,
## # Min_Risk <dbl>, Max_Risk <dbl>, Count <int>
Non-Fraudulent Transactions Amount (Fraud_Label = 0): Mean: $99.28 Median: $69.41 Standard Deviation: $98.46 Range: $0.01 to $1,174.14
Fraudulent Transactions Amount (Fraud_Label = 1): Mean: $99.68 Median: $68.70 Standard Deviation: $99.16 Range: $0.00 to $1,005.32
Risk Score by Fraud Classification: Non-Fraudulent Transactions: Mean Risk Score: 0.425 Median Risk Score: 0.426
Fraudulent Transactions: Mean Risk Score: 0.663 Median Risk Score: 0.806
The mean risk score for fraudulent transactions (0.663) is substantially higher than for non-fraudulent transactions (0.425), representing a difference of approximately 0.238 - almost one standard deviation (0.288).
ggplot(fraud_data, aes(x = Transaction_Amount, fill = Fraud_Label)) +
geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
labs(title = "Distribution of Transaction Amounts",
x = "Transaction Amount ($)",
fill = "Fraudulent?") +
scale_fill_manual(values = c("cornflowerblue", "deeppink"))
The histogram shows the relationship between the size and frequency of the transaction, segmented between the fraud status. There is a strong positive skew — both fraudulent and non-fraudulent transaction distributions are right-skewed, with the majority of transactions concentrated at lower amounts (below $200) and a long tail extending toward higher values. Non-fraudulent transactions (blue) significantly outnumber fraudulent ones (magenta), consistent with the dataset’s class imbalance (33,933 non-fraud vs. 16,067 fraud cases). The distribution spans from near $0 to approximately $1,200, with the density rapidly decreasing beyond $400. While both fraud and non-fraud transactions follow similar distributional shapes, there appears to be a slightly higher proportion of fraudulent transactions in the $50-150 range relative to their overall frequency, suggesting that fraudsters may target moderate transaction amounts that are less likely to trigger scrutiny.
ggplot(fraud_data, aes(x = Fraud_Label, y = Risk_Score, fill = Fraud_Label)) +
geom_boxplot() + labs(title = "Risk Score Distribution by Fraud Label",
x = "Fraudulent?",
y = "Risk Score",
fill = "Fraudulent?") +
scale_x_discrete(labels = c("0" = "Not Fraud", "1" = "Fraud")) +
scale_fill_manual(values = c("palegreen", "plum1"))
The box plot illustrates the risk score distribution, separated by the transaction’s fraud classification. The median risk score for fraudulent transactions (approximately 0.80) is substantially higher than for non-fraudulent transactions (approximately 0.43), indicating good discriminative power of the risk scoring system. The fraud category shows an IQR from approximately 0.40 to 0.92, while non-fraud ranges from 0.22 to 0.63, demonstrating considerable overlap but distinct central tendencies. Both categories span nearly the entire possible range (0-1) of risk scores, suggesting some potential misclassifications or edge cases in both directions. The fraud category exhibits slightly greater variance in risk scores, suggesting more heterogeneity in fraudulent behavior patterns. The differentiation between groups indicates statistical significance in the risk score’s ability to distinguish between fraudulent and legitimate transactions.
ggplot(fraud_data, aes(x = Transaction_Amount, y = Risk_Score, color = as.factor(Fraud_Label))) +
geom_point(alpha = 0.6) +
labs(title = "Transaction Amount vs. Risk Score", x = "Transaction Amount ($)", y = "Risk Score", color = "Fraudulent?") + scale_color_manual(values=c("palegreen3", "gold2"))
The scatter plot shows the relationship between transaction amounts and the risk score, with fraudulent transactions highlighted in gold. Most transactions cluster in the lower transaction amount range (€$0-400) across the full spectrum of risk scores. Fraudulent transactions (gold) predominantly occupy the upper risk score region (>0.75), particularly at lower transaction amounts. The variance in risk scores decreases as transaction amounts increase, suggesting more consistent risk assessment for larger transactions. Higher-value transactions (>$600) are predominantly non-fraudulent despite occasional high risk scores, indicating that transaction size alone is not predictive of fraud. The fraudulent transactions show a bimodal distribution with concentrations at very high risk scores (near 1.0) and more moderate scores (0.4-0.7), suggesting potential subtypes of fraud with different risk signatures.