Homework Assignment 1

Fraud Detection Transactions Dataset

This is a synthetic dataset of fraudulent transactions, containing 21 columns. Source: https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset

fraud_detection_transactions <- read.csv("synthetic_fraud_dataset.csv")

# select only the 6 relevant variables
fraud_data <- fraud_detection_transactions %>%
  select(Transaction_Amount, Transaction_Type, Account_Balance, 
         Risk_Score, Fraud_Label, Daily_Transaction_Count)

head(fraud_data)

##   Transaction_Amount Transaction_Type Account_Balance Risk_Score Fraud_Label
## 1              39.79              POS        93213.17     0.8494           0
## 2               1.19    Bank Transfer        75725.25     0.0959           1
## 3              28.96           Online         1588.96     0.8400           1
## 4             254.32   ATM Withdrawal        76807.20     0.7935           1
## 5              31.28              POS        92354.66     0.3819           1
## 6             168.55           Online        33236.94     0.0504           0
##   Daily_Transaction_Count
## 1                       7
## 2                      13
## 3                      14
## 4                       8
## 5                      14
## 6                       3

str(fraud_data)

## 'data.frame':    50000 obs. of  6 variables:
##  $ Transaction_Amount     : num  39.79 1.19 28.96 254.32 31.28 ...
##  $ Transaction_Type       : chr  "POS" "Bank Transfer" "Online" "ATM Withdrawal" ...
##  $ Account_Balance        : num  93213 75725 1589 76807 92355 ...
##  $ Risk_Score             : num  0.8494 0.0959 0.84 0.7935 0.3819 ...
##  $ Fraud_Label            : int  0 1 1 1 1 0 0 1 0 0 ...
##  $ Daily_Transaction_Count: int  7 13 14 8 14 3 2 3 7 6 ...

The dataset shows 6 variables, with 50,000 observations. These include: Transaction_Amount (Numeric; Ratio): The amount spent in a single transaction. Unit of measurement: Dollars. Transaction_Type (Categorical; Nominal): The method of transaction (POS, Online, ATM Withdrawal, Bank Transfer.). Account_Balance (Numeric; Ratio): The user’s bank balance before the transaction. Unit of measurement: Dollars. Risk_Score (Numeric; Interval): A fraud probability score between 0 and 1 (higher = riskier). Fraud_Label (Categorical; Nominal): Indicates whether the transaction was fraudulent (1) or not (0). Daily_Transaction_Count (Numeric; Ratio): The total number of transactions a user made on that day.

The unit of measurement is transactions, as each row corresponds to a single transaction.

Check for missing variables

print("Missing Values per Variable")

## [1] "Missing Values per Variable"

print(colSums(is.na(fraud_detection_transactions)))

##               Transaction_ID                      User_ID 
##                            0                            0 
##           Transaction_Amount             Transaction_Type 
##                            0                            0 
##                    Timestamp              Account_Balance 
##                            0                            0 
##                  Device_Type                     Location 
##                            0                            0 
##            Merchant_Category              IP_Address_Flag 
##                            0                            0 
## Previous_Fraudulent_Activity      Daily_Transaction_Count 
##                            0                            0 
##    Avg_Transaction_Amount_7d  Failed_Transaction_Count_7d 
##                            0                            0 
##                    Card_Type                     Card_Age 
##                            0                            0 
##         Transaction_Distance        Authentication_Method 
##                            0                            0 
##                   Risk_Score                   Is_Weekend 
##                            0                            0 
##                  Fraud_Label 
##                            0

The printed summary indicates no missing variables.

Sort out categorical variables

fraud_data$Transaction_Type <- factor(fraud_data$Transaction_Type)
fraud_data$Fraud_Label <- as.factor(fraud_data$Fraud_Label)
levels(fraud_data$Transaction_Type)

## [1] "ATM Withdrawal" "Bank Transfer"  "Online"         "POS"

Descriptive Statistics

Summarise all variables now that categories have been vectorised

summary(fraud_data)

##  Transaction_Amount       Transaction_Type Account_Balance     Risk_Score    
##  Min.   :   0.00    ATM Withdrawal:12453   Min.   :  500.5   Min.   :0.0001  
##  1st Qu.:  28.68    Bank Transfer :12452   1st Qu.:25356.0   1st Qu.:0.2540  
##  Median :  69.66    Online        :12546   Median :50384.4   Median :0.5022  
##  Mean   :  99.41    POS           :12549   Mean   :50294.1   Mean   :0.5016  
##  3rd Qu.: 138.85                           3rd Qu.:75115.1   3rd Qu.:0.7495  
##  Max.   :1174.14                           Max.   :99998.3   Max.   :1.0000  
##  Fraud_Label Daily_Transaction_Count
##  0:33933     Min.   : 1.000         
##  1:16067     1st Qu.: 4.000         
##              Median : 7.000         
##              Mean   : 7.485         
##              3rd Qu.:11.000         
##              Max.   :14.000

sd(fraud_data$Transaction_Amount)

## [1] 98.68729

sd(fraud_data$Risk_Score)

## [1] 0.2877741

Transaction Amount: Mean: $99.41 (significantly higher than median, suggests right skew) Median: $69.66 (more representative of typical transaction) Mode: Not explicitly provided, but from the histogram appears to be in the $0-50 range Standard Deviation: $98.69 (very close to the mean, indicating high variability) Range: $0 to $1,174.14 (extremely wide range with outliers)

Risk Score: Mean: 0.5016 (almost exactly 0.5, suggesting balanced scoring) Median: 0.5022 (remarkably close to mean) Quartiles: 0.2540 (Q1) and 0.7495 (Q3) Standard deviation: 0.2877741, shows moderate variability around the mean, with approximately 68% of all transactions having risk scores between 0.214 and 0.789.

Fraud Label Distribution Non-fraud (0): 33,933 transactions (67.87%) Fraud (1): 16,067 transactions (32.13%)

Descriptive Statistics by Fraud Label

fraud_stats <- fraud_data %>%
  group_by(Fraud_Label) %>%
  summarize(
    Mean_Amount = mean(Transaction_Amount),
    Median_Amount = median(Transaction_Amount),
    SD_Amount = sd(Transaction_Amount),
    Min_Amount = min(Transaction_Amount),
    Max_Amount = max(Transaction_Amount),
    
    Mean_Risk = mean(Risk_Score),
    Median_Risk = median(Risk_Score),
    SD_Risk = sd(Risk_Score),
    Min_Risk = min(Risk_Score),
    Max_Risk = max(Risk_Score),

    Count = n()
  )

print(fraud_stats)

## # A tibble: 2 × 12
##   Fraud_Label Mean_Amount Median_Amount SD_Amount Min_Amount Max_Amount
##   <fct>             <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 0                  99.3          69.4      98.5       0.01      1174.
## 2 1                  99.7          70.1      99.2       0         1005.
## # ℹ 6 more variables: Mean_Risk <dbl>, Median_Risk <dbl>, SD_Risk <dbl>,
## #   Min_Risk <dbl>, Max_Risk <dbl>, Count <int>

Non-Fraudulent Transactions Amount (Fraud_Label = 0): Mean: $99.28 Median: $69.41 Standard Deviation: $98.46 Range: $0.01 to $1,174.14

Fraudulent Transactions Amount (Fraud_Label = 1): Mean: $99.68 Median: $68.70 Standard Deviation: $99.16 Range: $0.00 to $1,005.32

Risk Score by Fraud Classification: Non-Fraudulent Transactions: Mean Risk Score: 0.425 Median Risk Score: 0.426

Fraudulent Transactions: Mean Risk Score: 0.663 Median Risk Score: 0.806

The mean risk score for fraudulent transactions (0.663) is substantially higher than for non-fraudulent transactions (0.425), representing a difference of approximately 0.238 - almost one standard deviation (0.288).

Visualisation

ggplot(fraud_data, aes(x = Transaction_Amount, fill = Fraud_Label)) +
  geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
  labs(title = "Distribution of Transaction Amounts", 
       x = "Transaction Amount ($)", 
       fill = "Fraudulent?") + 
  scale_fill_manual(values = c("cornflowerblue", "deeppink"))

The histogram shows the relationship between the size and frequency of the transaction, segmented between the fraud status. There is a strong positive skew — both fraudulent and non-fraudulent transaction distributions are right-skewed, with the majority of transactions concentrated at lower amounts (below $200) and a long tail extending toward higher values. Non-fraudulent transactions (blue) significantly outnumber fraudulent ones (magenta), consistent with the dataset’s class imbalance (33,933 non-fraud vs. 16,067 fraud cases). The distribution spans from near $0 to approximately $1,200, with the density rapidly decreasing beyond $400. While both fraud and non-fraud transactions follow similar distributional shapes, there appears to be a slightly higher proportion of fraudulent transactions in the $50-150 range relative to their overall frequency, suggesting that fraudsters may target moderate transaction amounts that are less likely to trigger scrutiny.

ggplot(fraud_data, aes(x = Fraud_Label, y = Risk_Score, fill = Fraud_Label)) +
  geom_boxplot() + labs(title = "Risk Score Distribution by Fraud Label", 
                        x = "Fraudulent?", 
                        y = "Risk Score", 
                        fill = "Fraudulent?") + 
  scale_x_discrete(labels = c("0" = "Not Fraud", "1" = "Fraud")) + 
  scale_fill_manual(values = c("palegreen", "plum1"))

The box plot illustrates the risk score distribution, separated by the transaction’s fraud classification. The median risk score for fraudulent transactions (approximately 0.80) is substantially higher than for non-fraudulent transactions (approximately 0.43), indicating good discriminative power of the risk scoring system. The fraud category shows an IQR from approximately 0.40 to 0.92, while non-fraud ranges from 0.22 to 0.63, demonstrating considerable overlap but distinct central tendencies. Both categories span nearly the entire possible range (0-1) of risk scores, suggesting some potential misclassifications or edge cases in both directions. The fraud category exhibits slightly greater variance in risk scores, suggesting more heterogeneity in fraudulent behavior patterns. The differentiation between groups indicates statistical significance in the risk score’s ability to distinguish between fraudulent and legitimate transactions.

ggplot(fraud_data, aes(x = Transaction_Amount, y = Risk_Score, color = as.factor(Fraud_Label))) +
  geom_point(alpha = 0.6) +
  labs(title = "Transaction Amount vs. Risk Score", x = "Transaction Amount ($)", y = "Risk Score", color = "Fraudulent?") + scale_color_manual(values=c("palegreen3", "gold2"))

The scatter plot shows the relationship between transaction amounts and the risk score, with fraudulent transactions highlighted in gold. Most transactions cluster in the lower transaction amount range (€$0-400) across the full spectrum of risk scores. Fraudulent transactions (gold) predominantly occupy the upper risk score region (>0.75), particularly at lower transaction amounts. The variance in risk scores decreases as transaction amounts increase, suggesting more consistent risk assessment for larger transactions. Higher-value transactions (>$600) are predominantly non-fraudulent despite occasional high risk scores, indicating that transaction size alone is not predictive of fraud. The fraudulent transactions show a bimodal distribution with concentrations at very high risk scores (near 1.0) and more moderate scores (0.4-0.7), suggesting potential subtypes of fraud with different risk signatures.