1. Introduction

1.1 Project Background

In recent years, payment methods for transactions have seen a significant transformation. The use of physical cash has gradually transitioned to cashless payment systems, including debit cards, credit cards, and QR code payments. However, this rise in cashless transactions has been accompanied by a parallel increase in credit card fraud incidents. Credit card fraudulent transactions occur when sellers do not receive payment or cardholders receive unauthorised charges. Recent statistics in Malaysia reveal that fraud has caused cumulative losses of staggering RM52.02 billion, underscoring the severity of this problem. The consequences of credit card fraud extend beyond immediate financial losses; it affects equally merchants’ and institutions’ reputation, compromises users’ credit scores, and ultimately causes loss of trust towards digital solutions.

1.2 Problem Statement

Despite substantial efforts to combat credit card fraud through traditional rule-based systems and manual review processes, these approaches face significant limitations such as being labour intensive and hard to scale. This calls for a critical need for an automated solution that can accurately identify fraudulent transactions in real-time. Machine learning approaches are promising with their capacity to detect the pattern complexity of fraudulent transactions and anomalies that might be overlooked by traditional techniques.

1.3 Project Objectives

With the proliferation of credit card fraud, this project aims to leverage machine learning techniques to develop a credit card fraud detection solution. Two primary objectives are defined as follows:

Fraud Detection (Classification Task):

Develop binary classification models that accurately identify whether a transaction is fraudulent or legitimate.
Allow stakeholders to investigate transactions that are flagged as fraud, identify the accounts involved, and take appropriate preventive actions such as freezing the transaction and blacklisting the associated accounts.
Provide interpretable insights into the key features that contribute to fraud detection, which allow for an improved understanding of the characteristics of fraudulent behavior.

Transaction Amount Prediction (Regression Task):

Develop regressor models that predict the transaction amount.
Serve as a supplementary extensive analysis tool to understand the complex relationship between transaction features and transaction amount.
Enable stakeholders to identify factors that correlate with high-value transactions which contribute to an elevated risk assessment strategy.

1.4 Dataset Information

Source: Kaggle dataset “Credit Card Fraud Data” (link)
Size: 14,446 transactions × 15 features (Dimension of 216,690 data points)
Target Variables:
- is_fraud: Binary indicator (1 = fraud, 0 = legitimate)
- amt: Transaction amount (for regression analysis)

1.5 Environment Setup

install.packages(c(
  # Data manipulation
  "tidyverse", "dplyr", "readr", "stringr", "lubridate", "tidyr",
  # Preprocessing
  "caret", "geosphere", "fastDummies", 
  # Machine Learning
  "smotefamily", "randomForest", "xgboost", "lightgbm", "e1071", "ranger",
  # Modeling framework
  "tidymodels", "themis", "broom", 
  # Evaluation
  "PRROC", "pROC", "fastshap",
  # Visualization
  "ggplot2", "gridExtra", "vip","SHAPforxgboost", "shapviz"
))

# Data manipulation and visualization
library(tidyverse)
library(lubridate)
library(gridExtra)
library(shapviz)

# Machine learning
library(caret)
library(xgboost)
library(lightgbm)
library(randomForest)
library(e1071)
library(glmnet)
library(tidymodels)
library(SHAPforxgboost)

# Evaluation and utilities
library(pROC)
library(PRROC)
library(vip)
library(smotefamily)
library(FNN)
library(geosphere)
library(fastshap)

set.seed(123)

2. Exploratory Data Analysis (EDA)

2.1 Data Loading and Overview

# Load dataset
fraud_raw <- read_csv("fraud_data.csv", show_col_types = FALSE)

# Overview of data
cat("Dataset dimensions:", nrow(fraud_raw), "rows,", ncol(fraud_raw), "columns\n")

## Dataset dimensions: 14446 rows, 15 columns

str(fraud_raw)

## spc_tbl_ [14,446 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ trans_date_trans_time: chr [1:14446] "04-01-2019 00:58" "04-01-2019 15:06" "04-01-2019 22:37" "04-01-2019 23:06" ...
##  $ merchant             : chr [1:14446] "\"Stokes, Christiansen and Sipes\"" "Predovic Inc" "Wisozk and Sons" "Murray-Smitham" ...
##  $ category             : chr [1:14446] "grocery_net" "shopping_net" "misc_pos" "grocery_pos" ...
##  $ amt                  : num [1:14446] 14.4 966.1 49.6 295.3 18.2 ...
##  $ city                 : chr [1:14446] "Wales" "Wales" "Wales" "Wales" ...
##  $ state                : chr [1:14446] "AK" "AK" "AK" "AK" ...
##  $ lat                  : num [1:14446] 64.8 64.8 64.8 64.8 64.8 ...
##  $ long                 : num [1:14446] -166 -166 -166 -166 -166 ...
##  $ city_pop             : num [1:14446] 145 145 145 145 145 145 145 602 145 145 ...
##  $ job                  : chr [1:14446] "\"Administrator, education\"" "\"Administrator, education\"" "\"Administrator, education\"" "\"Administrator, education\"" ...
##  $ dob                  : chr [1:14446] "09-11-1939" "09-11-1939" "09-11-1939" "09-11-1939" ...
##  $ trans_num            : chr [1:14446] "a3806e984cec6ac0096d8184c64ad3a1" "a59185fe1b9ccf21323f581d7477573f" "86ba3a888b42cd3925881fa34177b4e0" "3a068fe1d856f0ecedbed33e4b5f4496" ...
##  $ merch_lat            : num [1:14446] 65.7 65.5 65.3 64.4 65.4 ...
##  $ merch_long           : num [1:14446] -165 -165 -166 -166 -165 ...
##  $ is_fraud             : num [1:14446] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trans_date_trans_time = col_character(),
##   ..   merchant = col_character(),
##   ..   category = col_character(),
##   ..   amt = col_double(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   city_pop = col_double(),
##   ..   job = col_character(),
##   ..   dob = col_character(),
##   ..   trans_num = col_character(),
##   ..   merch_lat = col_double(),
##   ..   merch_long = col_double(),
##   ..   is_fraud = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Create working dataframe with preprocessing
fraud_df <- fraud_raw %>%
  distinct() %>%
  mutate(
    is_fraud = factor(is_fraud, levels = c(0, 1), labels = c("Legit", "Fraud")),
    amt = as.numeric(amt),
    trans_date = dmy_hm(trans_date_trans_time),
    trans_month = month(trans_date, label = TRUE, abbr = TRUE),
    trans_hour = hour(trans_date),
    trans_weekday = wday(trans_date, label = TRUE)
  )

# Check for missing values
cat("\nMissing values per column:\n")

## 
## Missing values per column:

print(colSums(is.na(fraud_df)))

## trans_date_trans_time              merchant              category 
##                     0                     0                     0 
##                   amt                  city                 state 
##                     0                     0                     0 
##                   lat                  long              city_pop 
##                     0                     0                     0 
##                   job                   dob             trans_num 
##                     0                     0                     0 
##             merch_lat            merch_long              is_fraud 
##                     0                     0                     2 
##            trans_date           trans_month            trans_hour 
##                     0                     0                     0 
##         trans_weekday 
##                     0

2.2 Class Distribution

# Class distribution
fraud_counts <- table(fraud_df$is_fraud)
pie(fraud_counts,
    main = "Fraud vs Legit Transactions",
    col = c("lightgreen", "red"),
    labels = paste0(names(fraud_counts), "\n", fraud_counts, " (",
                    round(prop.table(fraud_counts) * 100, 1), "%)"))

Key Finding: This is an imbalanced dataset with fraud accounting for only 12.4% of the total, signifying that there is a need for imbalance data strategy to be implemented for the classification task.

2.3 Transaction Amount Analysis

2.3.1 Summary Statistics

# Calculate summary for the total dataset
total_stat <- fraud_df %>%
  summarise(
    Group = "All Transactions",
    Min = min(amt),
    Median = median(amt),
    Mean = mean(amt),
    Max = max(amt),
    SD = sd(amt),
    Count = n()
  )

# Calculate summary grouped by fraud status
class_stat <- fraud_df %>%
  group_by(is_fraud) %>%
  summarise(
    Min = min(amt),
    Median = median(amt),
    Mean = mean(amt),
    Max = max(amt),
    SD = sd(amt),
    Count = n()
  ) %>%
  rename(Group = is_fraud) %>%
  mutate(Group = as.character(Group))

# Summary of amt, amt that is legit and amt that is fraud
combined_summary <- bind_rows(total_stat, class_stat)
print(combined_summary)

## # A tibble: 4 × 7
##   Group              Min Median  Mean    Max     SD Count
##   <chr>            <dbl>  <dbl> <dbl>  <dbl>  <dbl> <int>
## 1 All Transactions  1      51.3 123.  3261.  229.   14383
## 2 Legit             1      46.1  66.8 3261.  114.   12600
## 3 Fraud             1.78  358.  518.  1372.  390.    1781
## 4 <NA>              7.99   13.7  13.7   19.4   8.10     2

Key Finding: Fraudulent transactions have a higher mean ($518 compared to $66) and median ($357 compared to $46) amount with larger variability (standard deviation of $389 compared to $114). This indicates that scammers generally transact a higher amount for their fraudulent operations.

2.3.2 Amount Distribution

# Overall distribution
p1 <- ggplot(fraud_df, aes(x = amt)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  labs(title = "All Transactions", x = "Amount ($)", y = "Frequency") +
  theme_minimal()

# Low-value transactions
p2 <- ggplot(fraud_df %>% filter(amt < 500), aes(x = amt, fill = is_fraud)) +
  geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
  scale_fill_manual(values = c("Legit" = "steelblue", "Fraud" = "red")) +
  labs(title = "Low-Value (< $500)", x = "Amount ($)", y = "Frequency", fill = "Status") +
  theme_minimal()

# High-value transactions
p3 <- ggplot(fraud_df %>% filter(amt >= 500), aes(x = amt, fill = is_fraud)) +
  geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
  scale_fill_manual(values = c("Legit" = "steelblue", "Fraud" = "red")) +
  labs(title = "High-Value (≥ $500)", x = "Amount ($)", y = "Frequency", fill = "Status") +
  theme_minimal()

grid.arrange(p1, p2, p3, ncol = 3)

Key Finding: A large majority of fraud occurs for values that are above $500, while legitimate transactions fall in the range that is below $500, indicating once again that fraudsters tend to transact with higher amounts.

2.4 Geographic Distribution Analysis

2.4.1 State-Level Patterns

# Calculate state-level metrics
state_summary <- fraud_df %>%
  group_by(state, is_fraud) %>%
  summarise(total_amt = sum(amt, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = is_fraud, values_from = total_amt, values_fill = 0) %>%
  mutate(
    total = Legit + Fraud,
    fraud_rate = Fraud / total * 100
  ) %>%
  arrange(desc(fraud_rate))

# Top 10 states by fraud rate
top_fraud_states <- head(state_summary, 10)

ggplot(top_fraud_states, aes(x = reorder(state, fraud_rate), y = fraud_rate)) +
  geom_col(fill = "red", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 10 States by Fraud Loss Rate",
       x = "State", y = "Fraud Loss Rate (%)") +
  theme_minimal()

Key Finding: The graph displays that state AK shows the highest fraud rate at around 75%, followed by state NE at around 63%, indicating that geographic patterns matter.

2.4.2 City-Level Patterns

# Top 10 cities by fraud amount
city_fraud <- fraud_df %>%
  filter(is_fraud == "Fraud") %>%
  group_by(city) %>%
  summarise(fraud_amt = sum(amt, na.rm = TRUE)) %>%
  arrange(desc(fraud_amt)) %>%
  head(10)

ggplot(city_fraud, aes(x = reorder(city, fraud_amt), y = fraud_amt)) +
  geom_col(fill = "red", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 10 Cities by Fraud Amount",
       x = "City", y = "Total Fraud Amount ($)") +
  theme_minimal()

Key Finding: The graph shows the total fraud amount for cities, where Seattle leads followed by Aurora and Albuquerque, similarly indicating that cities differ in terms of transaction fraud amounts.

2.5 Category Purchase Analysis

# Transaction frequency by category
p1 <- fraud_df %>%
  count(category, is_fraud) %>%
  ggplot(aes(x = reorder(category, n), y = n, fill = is_fraud)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Legit" = "steelblue", "Fraud" = "red")) +
  coord_flip() +
  labs(title = "Transaction Frequency by Category",
       x = "Category", y = "Count", fill = "Status") +
  theme_minimal()

# Total amount by category
p2 <- fraud_df %>%
  group_by(category, is_fraud) %>%
  summarise(total = sum(amt, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = reorder(category, total), y = total, fill = is_fraud)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = c("Legit" = "steelblue", "Fraud" = "red")) +
  coord_flip() +
  labs(title = "Total Amount by Category",
       x = "Category", y = "Total Amount ($)", fill = "Status") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

Key Finding: The graph reveals interesting insights: firstly, grocery_pos is the category with the highest fraud frequency but in terms of value it places second. On the other hand, shopping_net has the highest fraud in terms of amount but is third highest for frequency. This shows that fraud frequency and amount are independent of each other.

2.6 Hour Pattern Analysis

hourly_summary <- fraud_df %>%
  group_by(trans_hour, is_fraud) %>%
  summarise(count = n(),total_amt = sum(amt, na.rm = TRUE),.groups = "drop")

# Hourly trends - transaction count 
p1 <- ggplot(fraud_df, aes(x = trans_hour, fill = is_fraud)) +
  geom_histogram(binwidth = 1, position = "stack", color = "white", linewidth = 0.1) +
  coord_flip() + 
  scale_fill_manual(values = c("Legit" = "steelblue", "Fraud" = "red")) +
  scale_x_reverse(breaks = seq(0, 23, by = 1)) + 
  labs(
    title = "Hourly Transaction Count",
    x = "Hour of Day (00:00 - 23:00)",
    y = "Number of Transactions",
    fill = "Status"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Hourly trends - transaction amount
p2 <- fraud_df %>%
  group_by(trans_hour, is_fraud) %>%
  summarise(total_amt = sum(amt, na.rm = TRUE), .groups = 'drop') %>%
  ggplot(aes(x = trans_hour, y = total_amt, fill = is_fraud)) +
  geom_col(position = "stack", color = "white", linewidth = 0.1) +
  coord_flip() + 
  scale_fill_manual(values = c("Legit" = "steelblue", "Fraud" = "red")) +
  scale_x_reverse(breaks = seq(0, 23, by = 1)) + 
  scale_y_continuous(labels = label_dollar(scale = 1e-3, suffix = "K")) +
  labs(
    title = "Hourly Total Transaction Amount",
    subtitle = "Financial volume ($) per hour",
    x = "Hour of Day (00:00 - 23:00)",
    y = "Total Amount ($)",
    fill = "Status"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

grid.arrange(p1, p2, ncol = 2)

Key Finding: Looking at frequency, the fraud count mainly concentrated in the late-night and early morning. Analyzing the transaction amounts, the number of transaction is relatively low at 10pm-12pm but the fraud amount is often disproportionately high.

3. Data Preparation

3.1 Data Cleaning

# Remove duplicates and handle missing values
fraud_clean <- fraud_raw %>%
  distinct() %>%
  filter(complete.cases(.))

# Parse datetime and extract temporal features and age addition
fraud_clean <- fraud_clean %>%
  mutate(
    trans_date = dmy_hm(trans_date_trans_time),
    trans_year = year(trans_date),
    trans_month = month(trans_date),
    trans_day = day(trans_date),
    trans_hour = hour(trans_date),
    trans_weekday = wday(trans_date),
    dob_date = dmy(dob)
  )

cat("Cleaned dataset:", nrow(fraud_clean), "rows\n")

## Cleaned dataset: 14381 rows

Description: Firstly we remove duplicated values, then we extract temporal features from the existing dmy_hm column.

3.2 Feature Engineering

3.2.1 Temporal Features

# Encode hour and weekday as cyclical features
fraud_clean <- fraud_clean %>%
  mutate(
    hour_sin = sin(2 * pi * trans_hour / 24),
    hour_cos = cos(2 * pi * trans_hour / 24),
    weekday_sin = sin(2 * pi * trans_weekday / 7),
    weekday_cos = cos(2 * pi * trans_weekday / 7)
  )

# Add age features
fraud_clean <- fraud_clean %>%
  mutate(
    age = year(trans_date) - year(dob_date)
  )

Description: Temporal features are encoded as cyclical sine and cosine transformations to preserve the circular nature of time. Age is calculated from date of birth to transaction date.

3.2.2 Geographic Distance Feature

# Calculate Haversine distance between transaction and cardholder locations
fraud_clean <- fraud_clean %>%
  rowwise() %>%
  mutate(
    distance_km = distHaversine(
      c(long, lat),
      c(merch_long, merch_lat)
    ) / 1000
  ) %>%
  ungroup()

cat("Distance feature created. Range:",
    round(min(fraud_clean$distance_km, na.rm = TRUE), 2), "to",
    round(max(fraud_clean$distance_km, na.rm = TRUE), 2), "km\n")

## Distance feature created. Range: 0.64 to 143.72 km

Description: Geographic distance is calculated using the Haversine formula to measure the distance between cardholder location and merchant location, which may indicate unusual transaction patterns.

3.2.3 Fraud Rate Features

# Add fraud rate per merchant
merchant_fraud_rates <- fraud_clean %>%
  group_by(merchant) %>%
  summarise(
    merchant_fraud_rate = mean(is_fraud, na.rm = TRUE),
    merchant_trans_count = n()
  )

fraud_clean <- fraud_clean %>%
  left_join(merchant_fraud_rates, by = "merchant")

# Add fraud rate per job
job_fraud_rates <- fraud_clean %>%
  group_by(job) %>%
  summarise(
    job_fraud_rate = mean(is_fraud, na.rm = TRUE),
    job_trans_count = n()
  )

fraud_clean <- fraud_clean %>%
  left_join(job_fraud_rates, by = "job")

Description: Fraud rate features are engineered to capture historical fraud patterns associated with specific merchants and job categories, providing contextual risk indicators.

3.3 Feature Selection

# Select features for modeling
model_data <- fraud_clean %>%
  select(
    # Target variable
    is_fraud,
    # Numerical features
    amt, distance_km, merchant_fraud_rate, merchant_trans_count,
    job_fraud_rate, job_trans_count, city_pop, age,
    # Cyclical features
    hour_sin, hour_cos, weekday_sin, weekday_cos,
    # Categorical features
    category, state
  )

Description: Out of the data’s 32 columns, we filter and select only a handful of key features, grouping them by numerical features, cyclical temporal features, and categorical features.

3.4 Categorical Variable Encoding

# One-hot encode categorical variables
dummy_vars <- dummyVars(~ category + state, data = model_data, fullRank = TRUE)
encoded_features <- predict(dummy_vars, newdata = model_data)

Description: For categorical data, we carry out one-hot encoding to convert them into numerical data since the large majority of models learn and perform better with numerical data, while some strictly require all data to be numerical.

model_features <- model_data %>%
  select(-category, -state) %>%
  bind_cols(as.data.frame(encoded_features))

cat("Final feature count:", ncol(model_features) - 1, "features\n")

## Final feature count: 37 features

3.5 Train-Test Split (Stratified 70-30)

# Create stratified train-test split
train_indices <- createDataPartition(
  y = model_features$is_fraud,
  p = 0.7,
  list = FALSE
)

train_data <- model_features[train_indices, ]
test_data <- model_features[-train_indices, ]

cat("Training set:", nrow(train_data), "rows\n")

## Training set: 10067 rows

cat("Test set:", nrow(test_data), "rows\n")

## Test set: 4314 rows

cat("Train fraud rate:", round(mean(train_data$is_fraud) * 100, 2), "%\n")

## Train fraud rate: 12.49 %

cat("Test fraud rate:", round(mean(test_data$is_fraud) * 100, 2), "%\n")

## Test fraud rate: 12.15 %

Description: Stratified sampling is chosen for better representation of the actual label distribution in both the train and test sets, which allows for more accurate performance evaluation.

3.6 SMOTE Oversampling (Classification Only)

# 1. Prepare data for SMOTE (ensure target is numeric for the SMOTE function)
train_X <- train_data %>% select(-is_fraud)
train_y <- as.numeric(as.character(train_data$is_fraud))

# 2. Apply SMOTE to balance classes
# K=5 looks at 5 nearest neighbors; dup_size=5 creates 5 synthetic samples for each fraud case
smote_result <- SMOTE(
  X = train_X,
  target = train_y,
  K = 5,
  dup_size = 5
)

# 3. Create balanced training set and FIX FACTOR NAMES IMMEDIATELY
# This prevents the "X0, X1" error in caret later
train_smote <- smote_result$data %>%
  rename(is_fraud = class) %>%
  mutate(is_fraud = factor(is_fraud, levels = c(0, 1), labels = c("Legit", "Fraud")))

# 4. Fix test set names to match
test_data <- test_data %>%
  mutate(is_fraud = factor(is_fraud, levels = c(0, 1), labels = c("Legit", "Fraud")))

cat("SMOTE applied and Factor Labels fixed:\n")

## SMOTE applied and Factor Labels fixed:

cat("  New Class Levels:", levels(train_smote$is_fraud), "\n")

##   New Class Levels: Legit Fraud

cat("  SMOTE fraud rate:", round(mean(train_smote$is_fraud == "Fraud") * 100, 2), "%\n")

##   SMOTE fraud rate: 46.12 %

Description: SMOTE (Synthetic Minority Over-sampling Technique) is applied to the minority class to upsample the fraud label. This is a common and effective strategy in tackling imbalanced data, whereby through upsampling of the minority label we attempt to provide balanced representation for the model to learn.

3.7 Feature Scaling

# Identify numeric features
numeric_features <- c("amt", "distance_km", "merchant_fraud_rate", "merchant_trans_count",
                      "job_fraud_rate", "job_trans_count", "city_pop", "age",
                      "hour_sin", "hour_cos", "weekday_sin", "weekday_cos")

# 1. Fit scaler ONLY on the training data to prevent data leakage
scaler <- preProcess(train_smote[, numeric_features], method = c("center", "scale"))

# 2. Transform the datasets
train_smote_scaled <- train_smote
train_smote_scaled[, numeric_features] <- predict(scaler, train_smote[, numeric_features])

test_scaled <- test_data
test_scaled[, numeric_features] <- predict(scaler, test_data[, numeric_features])

# 3. Final Safety Check for NAs
# This ensures the na.fail error doesn't stop your training
if(any(is.na(train_smote_scaled))) {
  train_smote_scaled <- na.omit(train_smote_scaled)
  cat("Dropped rows with NAs in training set.\n")
}

cat("Feature scaling completed. Training samples:", nrow(train_smote_scaled), "\n")

## Feature scaling completed. Training samples: 16352

# Verify naming convention for caret
print(head(train_smote_scaled$is_fraud))

## [1] Fraud Fraud Fraud Fraud Fraud Fraud
## Levels: Legit Fraud

Description: Scaling is important to enable all numerical variables to be in a similar range, preventing variables with much higher values from dominating those with smaller values, ensuring fairer learning for the model.

4. Classification Analysis: Fraud Detection

Goal: Develop binary classification model(s) that accurately identify whether a transaction is fraudulent or legitimate.

# Prepare matrices for XGBoost
# Use the updated text labels "Fraud" to create the numeric 0/1 vector
X_train_mat <- as.matrix(train_smote_scaled %>% select(-is_fraud))
y_train <- ifelse(train_smote_scaled$is_fraud == "Fraud", 1, 0)

X_test_mat <- as.matrix(test_scaled %>% select(-is_fraud))
y_test <- ifelse(test_scaled$is_fraud == "Fraud", 1, 0)

# Safety check: Ensure y_train contains both 0 and 1
cat("Distribution of labels in y_train:\n")

## Distribution of labels in y_train:

print(table(y_train))

## y_train
##    0    1 
## 8810 7542

# Convert y_test to factor for confusion matrix (matches the "Legit"/"Fraud" format)
y_test_factor <- factor(y_test, levels = c(0, 1), labels = c("Legit", "Fraud"))

# Create DMatrices
dtrain <- xgb.DMatrix(data = X_train_mat, label = y_train)
dtest <- xgb.DMatrix(data = X_test_mat, label = y_test)

evaluation_results <- data.frame()
all_predictions <- list()

4.1 Model Training

4.1.1 XGBoost

dtrain <- xgb.DMatrix(data = X_train_mat, label = y_train)
dtest <- xgb.DMatrix(data = X_test_mat, label = y_test)
xgb <- xgb.train(
  params = list(objective = "binary:logistic", eval_metric = "logloss",
                max_depth = 4, eta = 0.1),
  data = dtrain, nrounds = 100, verbose = 0
)
# Predict
xgb_pred <- predict(xgb, dtest)

# Use text labels "Fraud" and "Legit"
xgb_pred_class <- factor(ifelse(xgb_pred > 0.5, "Fraud", "Legit"), levels = c("Legit", "Fraud"))

# Change positive to "Fraud"
xgb_cm <- confusionMatrix(xgb_pred_class, y_test_factor, positive = "Fraud")

evaluation_results <- rbind(evaluation_results, data.frame(
  Model = "XGBoost",
  ROC_AUC = auc(roc(y_test, xgb_pred, quiet = TRUE)),
  Accuracy = as.numeric(xgb_cm$overall["Accuracy"]),
  Precision = as.numeric(xgb_cm$byClass["Pos Pred Value"]),
  Recall = as.numeric(xgb_cm$byClass["Sensitivity"]),
  F1_Score = as.numeric(xgb_cm$byClass["F1"])
))
all_predictions[["XGBoost"]] <- xgb_pred

4.1.2 Random Forest

rf <- randomForest(is_fraud ~ ., data = train_smote_scaled, ntree = 100)

# Get probabilities for the "Fraud" column
rf_pred <- predict(rf, test_scaled, type = "prob")[, "Fraud"]

# Use text labels
rf_pred_class <- factor(ifelse(rf_pred > 0.5, "Fraud", "Legit"), levels = c("Legit", "Fraud"))

# Change positive to "Fraud"
rf_cm <- confusionMatrix(rf_pred_class, y_test_factor, positive = "Fraud")

evaluation_results <- rbind(evaluation_results, data.frame(
  Model = "Random Forest",
  ROC_AUC = auc(roc(y_test, rf_pred, quiet = TRUE)),
  Accuracy = as.numeric(rf_cm$overall["Accuracy"]),
  Precision = as.numeric(rf_cm$byClass["Pos Pred Value"]),
  Recall = as.numeric(rf_cm$byClass["Sensitivity"]),
  F1_Score = as.numeric(rf_cm$byClass["F1"])
))
all_predictions[["Random Forest"]] <- rf_pred

4.1.3 Logistic Regression

lr <- glm(is_fraud ~ ., data = train_smote_scaled, family = binomial)
lr_pred <- predict(lr, test_scaled, type = "response")

# Use text labels
lr_pred_class <- factor(ifelse(lr_pred > 0.5, "Fraud", "Legit"), levels = c("Legit", "Fraud"))

# Change positive to "Fraud"
lr_cm <- confusionMatrix(lr_pred_class, y_test_factor, positive = "Fraud")

evaluation_results <- rbind(evaluation_results, data.frame(
  Model = "Logistic Regression",
  ROC_AUC = auc(roc(y_test, lr_pred, quiet = TRUE)),
  Accuracy = as.numeric(lr_cm$overall["Accuracy"]),
  Precision = as.numeric(lr_cm$byClass["Pos Pred Value"]),
  Recall = as.numeric(lr_cm$byClass["Sensitivity"]),
  F1_Score = as.numeric(lr_cm$byClass["F1"])
))
all_predictions[["Logistic Regression"]] <- lr_pred

4.2 Evaluation Metrics

evaluation_results <- evaluation_results %>%
  mutate(ROC_AUC = as.numeric(ROC_AUC),
         Accuracy = as.numeric(Accuracy),
         Precision = as.numeric(Precision),
         Recall = as.numeric(Recall),
         F1_Score = as.numeric(F1_Score))
print(evaluation_results)

##                 Model   ROC_AUC  Accuracy Precision    Recall  F1_Score
## 1             XGBoost 0.9916735 0.9772833 0.9243028 0.8854962 0.9044834
## 2       Random Forest 0.9879718 0.9747334 0.9368421 0.8492366 0.8908909
## 3 Logistic Regression 0.9553616 0.9063514 0.5797872 0.8320611 0.6833856

# Reshape data to long format for grouped bars
class_results_long <- evaluation_results %>%
  select(Model, ROC_AUC, Accuracy, Precision, Recall, F1_Score) %>%
  pivot_longer(cols = c(ROC_AUC, Accuracy, Precision, Recall, F1_Score),
               names_to = "Metric",
               values_to = "Value")

# Create clustered bar chart
ggplot(class_results_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Comparison of Classification Evaluation Metrics",
       x = "Metric", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key Finding: By plotting the evaluation metrics for each model, it is evident that XGBoost has the best performance in 4 out of the 5 metrics, namely Accuracy, F1_Score, Recall, and ROC_AUC, except Precision. In terms of Precision, Random Forest performs the best. Overall, XGBoost is the best performing model for the classification task. We performed hyperparameter tuning to further optimize its ability to distinguish between legitimate and fraudulent transactions.

4.3 Hyperparameter Tuning

Using the tidymodels framework, we conducted a Grid Search with 5-fold Cross-Validation to refine the model:

trees: The total number of boosting iterations. tree_depth: The complexity of the individual decision trees. learn_rate: The speed at which the model adapts to errors from previous trees.

# 1. Define model specification (stays the same)
xgb_spec <- boost_tree(
  trees = tune(), 
  tree_depth = tune(), 
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# 2. Define grid (FIXED: Added learn_rate)
xgb_grid <- grid_regular(
  trees(range = c(100, 200)),
  tree_depth(range = c(3, 9)),
  learn_rate(range = c(-2, -1)), # This uses 10^-2 (0.01) to 10^-1 (0.1)
  levels = 5
)

# 3. Create cross-validation folds
set.seed(123)
cv_folds <- vfold_cv(train_smote_scaled, v = 5, strata = is_fraud)

# 4. Tune the grid
xgb_results <- tune_grid(
  object = xgb_spec,
  preprocessor = is_fraud ~ .,
  resamples = cv_folds,
  grid = xgb_grid,
  metrics = metric_set(roc_auc, f_meas, pr_auc) # Added more metrics for fraud
)

# 5. View the top 5 results
show_best(xgb_results, metric = "roc_auc")

## # A tibble: 5 × 9
##   trees tree_depth learn_rate .metric .estimator  mean     n   std_err .config  
##   <int>      <int>      <dbl> <chr>   <chr>      <dbl> <int>     <dbl> <chr>    
## 1   175          7        0.1 roc_auc binary     0.999     5 0.0000534 pre0_mod…
## 2   150          7        0.1 roc_auc binary     0.999     5 0.0000538 pre0_mod…
## 3   175          9        0.1 roc_auc binary     0.999     5 0.0000632 pre0_mod…
## 4   150          9        0.1 roc_auc binary     0.999     5 0.0000578 pre0_mod…
## 5   125          7        0.1 roc_auc binary     0.999     5 0.0000497 pre0_mod…

# 6. Select the single best parameter set
best_xgb <- select_best(xgb_results, metric = "roc_auc")

# 7. Finalize the model with those parameters
final_xgb <- finalize_model(xgb_spec, best_xgb)

# 8. Fit the final model to the full training data
final_fit <- final_xgb %>%
  fit(is_fraud ~ ., data = train_smote_scaled)

Key Finding: By testing 25 different combinations of these parameters, we identified an optimal configuration (trees = 175, tree_depth = 7, learn_rate = 0.1) that achieved a Mean ROC AUC of 0.9995.

4.4 Finalizing the model

# 1. Select the best parameters based on ROC AUC
best_params <- select_best(xgb_results, metric = "roc_auc")

# 2. Finalize the specification
final_xgb_spec <- finalize_model(xgb_spec, best_params)

# 3. Fit the model to the ENTIRE training set (SMOTE version)
final_model_fit <- final_xgb_spec %>%
  fit(is_fraud ~ ., data = train_smote_scaled)

We lock in the parameters of trees = 175, tree_depth = 7, and learn_rate = 0.1

4.5 Final prediction on test set

# 4. Generate class predictions (Legit/Fraud)
final_preds_class <- predict(final_model_fit, new_data = test_scaled)

# 5. Generate probability predictions
final_preds_prob <- predict(final_model_fit, new_data = test_scaled, type = "prob")

# 6. Combine for evaluation
final_results <- test_scaled %>%
  select(is_fraud) %>%
  bind_cols(final_preds_class, final_preds_prob)

# Using caret for the final report
conf_mat_final <- confusionMatrix(
  final_results$.pred_class, 
  final_results$is_fraud, 
  positive = "Fraud"
)

print(conf_mat_final)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Legit Fraud
##      Legit  3762    37
##      Fraud    28   487
##                                           
##                Accuracy : 0.9849          
##                  95% CI : (0.9808, 0.9884)
##     No Information Rate : 0.8785          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9289          
##                                           
##  Mcnemar's Test P-Value : 0.3211          
##                                           
##             Sensitivity : 0.9294          
##             Specificity : 0.9926          
##          Pos Pred Value : 0.9456          
##          Neg Pred Value : 0.9903          
##              Prevalence : 0.1215          
##          Detection Rate : 0.1129          
##    Detection Prevalence : 0.1194          
##       Balanced Accuracy : 0.9610          
##                                           
##        'Positive' Class : Fraud           
##

# 1. Extract Confusion Matrix Data
# 'final_results' contains the predicted class and the probabilities
cm_final <- confusionMatrix(final_results$.pred_class, 
                             final_results$is_fraud, 
                             positive = "Fraud")

cm_table_tuned <- as.data.frame(cm_final$table)

# Plot 1: Confusion Matrix Heatmap
p1 <- ggplot(cm_table_tuned, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white", size = 6) +
  scale_fill_gradient(low = "steelblue", high = "darkred") +
  labs(title = "Tuned XGBoost: Confusion Matrix",
       x = "Actual Status", y = "Predicted Status") +
  theme_minimal()

# 2. Extract Probabilities for ROC and PR
# Note: final_results$.pred_Fraud contains the probability of the transaction being fraud
probs_tuned <- final_results$.pred_Fraud
actuals_tuned <- ifelse(final_results$is_fraud == "Fraud", 1, 0)

# Plot 2: ROC Curve
roc_obj_tuned <- roc(actuals_tuned, probs_tuned)
p2 <- ggroc(roc_obj_tuned, colour = "blue", size = 1.2) +
  labs(title = paste("Tuned ROC (AUC =", round(auc(roc_obj_tuned), 4), ")")) +
  theme_minimal()

# Plot 3: Precision-Recall Curve
pr_tuned <- pr.curve(scores.class0 = probs_tuned[actuals_tuned == 1],
                     scores.class1 = probs_tuned[actuals_tuned == 0],
                     curve = TRUE)
pr_df_tuned <- data.frame(Recall = pr_tuned$curve[,1], Precision = pr_tuned$curve[,2])

p3 <- ggplot(pr_df_tuned, aes(x = Recall, y = Precision)) +
  geom_line(color = "darkgreen", size = 1.2) +
  labs(title = "Tuned Precision-Recall Curve",
       x = "Recall", y = "Precision") +
  theme_minimal()

# Combined Multi-Panel View
grid.arrange(p1, p2, p3, ncol = 3)

Key Finding: The final model achieved ROC AUC of 0.9963, indicating reliability in distinguishing between legitimate and fraudulent transactions. Operationally, the model maintains a superior balance between security and user experience, catching approximately 93% of all fraud while ensuring that 94.5% of flags are accurate, thereby protecting financial assets while minimizing unnecessary friction for legitimate customers.

4.6 Feature Importance

library(vip)

# 1. Extract the underlying engine fit and get importance
# This retrieves the importance scores from the tuned XGBoost engine
xgb_imp_data <- final_model_fit %>%
  extract_fit_engine() %>%
  vi()

# 2. View the importance values (Gain, Cover, Frequency)
print(xgb_imp_data)

## # A tibble: 37 × 2
##    Variable              Importance
##    <chr>                      <dbl>
##  1 amt                       0.614 
##  2 hour_cos                  0.109 
##  3 categorygas_transport     0.0619
##  4 merchant_fraud_rate       0.0354
##  5 job_fraud_rate            0.0221
##  6 categorygrocery_pos       0.0192
##  7 hour_sin                  0.0178
##  8 categorykids_pets         0.0154
##  9 merchant_trans_count      0.0107
## 10 age                       0.0101
## # ℹ 27 more rows

# 3. Create a polished ggplot2-based importance plot
p_imp <- final_model_fit %>%
  extract_fit_engine() %>%
  vip(num_features = 20, geom = "point", aesthetics = list(color = "darkred", size = 3)) +
  labs(title = "Tuned XGBoost - Top 20 Feature Importance",
       subtitle = "Based on Gain (Contribution to Model Accuracy)",
       x = "Features",
       y = "Importance Score") +
  theme_minimal()

print(p_imp)

Key Finding: The feature ‘amt’ has significantly higher predictive power than the others, approximately 4x greater than the second highest. This indicates that the model relies most heavily on transaction amount when making predictions. Another insight is that many features are temporal-related, such as hour_cos, hour_sin, and weekday_cos, signifying that the duration and time when fraud occurs do contribute to the model’s prediction.

4.7 SHAP

# 1. Ensure the matrix is numeric and has names
X_test_mat <- as.matrix(test_scaled %>% select(-is_fraud))
colnames(X_test_mat) <- colnames(test_scaled %>% select(-is_fraud))

# 2. Extract engine and calculate values
final_engine <- extract_fit_engine(final_model_fit)
shap_results <- shap.values(xgb_model = final_engine, X_train = X_test_mat)

# 3. Use the top 10 features to simplify the plot and avoid internal naming conflicts
shap_long <- shap.prep(xgb_model = final_engine, X_train = X_test_mat, top_n = 10)

# 4. Generate the plot
shap.plot.summary(shap_long)

Key Finding: High transaction amounts (yellow/orange dots) are strongly associated with a higher SHAP value, meaning larger transactions significantly increase the probability of being flagged as fraud.

merchant_fraud_rate: This is the second most impactful feature. Transactions at merchants with historically high fraud rates strongly push the model toward a fraud prediction.

hour_cos and hour_sin: These features show a wide spread of impact. The purple clusters on the right for hour_cos suggest that specific times of the day (likely late-night or early-morning hours) are strong indicators of fraudulent activity.

5. Regression Analysis: Amount Prediction

Goal: Develop regressor models that predict the transaction amount.

# Use original train/test split without SMOTE
train_reg <- train_data
test_reg <- test_data

# Standardize features for regression
train_reg_scaled <- train_reg
train_reg_scaled[, numeric_features] <- predict(scaler, train_reg[, numeric_features])
test_reg_scaled <- test_reg
test_reg_scaled[, numeric_features] <- predict(scaler, test_reg[, numeric_features])

# Remove is_fraud and amt columns for regression features
X_train_reg <- train_reg_scaled %>% select(-is_fraud, -amt)
y_train_reg <- train_reg_scaled$amt
X_test_reg <- test_reg_scaled %>% select(-is_fraud, -amt)
y_test_reg <- test_reg_scaled$amt

# Convert to matrices for glmnet and xgboost
X_train_reg_mat <- as.matrix(X_train_reg)
X_test_reg_mat <- as.matrix(X_test_reg)

cat("Regression data prepared:\n")

## Regression data prepared:

cat("  Training samples:", nrow(X_train_reg), "\n")

##   Training samples: 10067

cat("  Test samples:", nrow(X_test_reg), "\n")

##   Test samples: 4314

reg_evaluation_results <- data.frame()
all_reg_predictions <- list()

5.1 Model Training

5.1.1 Linear Regression (Elastic Net)

elastic_reg <- cv.glmnet(X_train_reg_mat, y_train_reg, alpha = 0.5)
elastic_reg_pred <- predict(elastic_reg, X_test_reg_mat, s = "lambda.min")
elastic_reg_pred <- as.vector(elastic_reg_pred)

elastic_reg_rmse <- sqrt(mean((elastic_reg_pred - y_test_reg)^2))
elastic_reg_mae <- mean(abs(elastic_reg_pred - y_test_reg))
elastic_reg_r2 <- cor(elastic_reg_pred, y_test_reg)^2

reg_evaluation_results <- rbind(reg_evaluation_results, data.frame(
  Model = "Elastic Net",
  RMSE = elastic_reg_rmse,
  MAE = elastic_reg_mae,
  R2 = elastic_reg_r2
))
all_reg_predictions[["Elastic Net"]] <- elastic_reg_pred

5.1.2 Random Forest

rf_reg <- randomForest(amt ~ ., data = train_reg_scaled %>% select(-is_fraud),
                       ntree = 100)
rf_reg_pred <- predict(rf_reg, test_reg_scaled)

rf_reg_rmse <- sqrt(mean((rf_reg_pred - y_test_reg)^2))
rf_reg_mae <- mean(abs(rf_reg_pred - y_test_reg))
rf_reg_r2 <- cor(rf_reg_pred, y_test_reg)^2

reg_evaluation_results <- rbind(reg_evaluation_results, data.frame(
  Model = "Random Forest",
  RMSE = rf_reg_rmse,
  MAE = rf_reg_mae,
  R2 = rf_reg_r2
))
all_reg_predictions[["Random Forest"]] <- rf_reg_pred

5.1.3 XGBoost

X_train_reg_mat <- as.matrix(X_train_reg)
X_test_reg_mat <- as.matrix(X_test_reg)
dtrain_reg <- xgb.DMatrix(data = X_train_reg_mat, label = y_train_reg)
dtest_reg <- xgb.DMatrix(data = X_test_reg_mat, label = y_test_reg)

xgb_reg_base <- xgb.train(
  params = list(objective = "reg:squarederror", eval_metric = "rmse",
                max_depth = 4, eta = 0.1),
  data = dtrain_reg, nrounds = 100, verbose = 0
)
xgb_reg_pred <- predict(xgb_reg_base, dtest_reg)

xgb_reg_rmse <- sqrt(mean((xgb_reg_pred - y_test_reg)^2))
xgb_reg_mae <- mean(abs(xgb_reg_pred - y_test_reg))
xgb_reg_r2 <- cor(xgb_reg_pred, y_test_reg)^2

reg_evaluation_results <- rbind(reg_evaluation_results, data.frame(
  Model = "XGBoost",
  RMSE = xgb_reg_rmse,
  MAE = xgb_reg_mae,
  R2 = xgb_reg_r2
))
all_reg_predictions[["XGBoost"]] <- xgb_reg_pred

5.2 Evaluation Metrics

print(reg_evaluation_results)

##           Model      RMSE       MAE        R2
## 1   Elastic Net 0.5367699 0.3347638 0.2825052
## 2 Random Forest 0.4170278 0.2275935 0.5699191
## 3       XGBoost 0.4251603 0.2322232 0.5514704

# Reshape data to long format for grouped bars
reg_results_long <- reg_evaluation_results %>%
  select(Model, RMSE, MAE, R2) %>%
  pivot_longer(cols = c(RMSE, MAE, R2),
               names_to = "Metric",
               values_to = "Value")

# Create clustered bar chart
ggplot(reg_results_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Comparison of Regression Evaluation Metrics",
       x = "Metric", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key Finding: Through plotting of the evaluation metrics, Random Forest has the best performance for all 3 metrics, including lowest MAE, highest R2, and lowest RMSE, indicating that Random Forest is the best model overall. XGBoost is the second best-performing model with performance that is slightly lower, while Elastic Net has the worst performance due to its simplicity and limitation in capturing complex relationships.

5.3 Hyperparameter Tuning

Similar to the previous classification model, hyperparameter tuning was performed using the tidymodels framework with 5‑fold cross‑validation and 1000 decision trees against the Random Forest model. The objective was to identify the optimal combination of the number of predictors randomly sampled at each split (mtry) and the minimum number of observations required in terminal (leaf) nodes (min_n).

#1. Preparing the data
train_reg_tidy <- train_reg_scaled %>% select(-is_fraud)
test_reg_tidy  <- test_reg_scaled %>% select(-is_fraud)

#2. Defining the tuning parameters
rf_spec <- rand_forest(
  mtry = tune(),        # number of predictors at each split
  trees = 1000,          # number of trees 1000
  min_n = tune()        # minimum node size
) %>%
  set_engine("randomForest") %>%
  set_mode("regression")

#3. Setting the recipe and workflow
rf_recipe <- recipe(amt ~ ., data = train_reg_tidy)

rf_workflow <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(rf_recipe)

#4. Tuning the grid
rf_grid <- grid_regular(
  mtry(range = c(2, floor(ncol(train_reg_tidy) / 3))),   # try between 2 and total predictors/3 predictors
  min_n(range = c(2, 10)), # try node sizes between 2 and 10
  levels = 5               # number of values per parameter
)

#5. Setting cross-validation folds
set.seed(123)
cv_folds <- vfold_cv(train_reg_tidy, v = 5)

#6. Execute tuning
rf_tuned <- tune_grid(
  rf_workflow,
  resamples = cv_folds,
  grid = rf_grid,
  metrics = metric_set(rmse, rsq, mae)
)

#7. Showing the best result
best_rf <- select_best(rf_tuned, metric = "rmse")
best_rf

## # A tibble: 1 × 3
##    mtry min_n .config         
##   <int> <int> <chr>           
## 1    12     6 pre0_mod23_post0

5.4 Finalising the model

Based on the best fit, the parameters are locked with the values of mtry=12 and min_n=6.

#1. Finalize workflow with best parameters
final_rf_workflow <- finalize_workflow(rf_workflow, best_rf)

#2. Fit on full training set
rf_final_fit <- fit(final_rf_workflow, data = train_reg_tidy)

5.5 Final prediction on test set

The comparison performance tabulated and presented in a comparison chart.

#1. Predict on test set
rf_preds <- predict(rf_final_fit, test_reg_tidy) %>%
  bind_cols(test_reg_tidy %>% select(amt))

#2. Display the metrics
metrics <- rf_preds %>%
  metrics(truth = amt, estimate = .pred)

metrics_comp <- cbind(metrics,"before"=c(rf_reg_rmse, rf_reg_r2, rf_reg_mae)) %>% rename(after=.estimate) %>% select (-.estimator)
metrics_comp[metrics_comp[".metric"]=="rsq",1]="r2"
metrics_comp

##   .metric     after    before
## 1    rmse 0.4142039 0.4170278
## 2      r2 0.5764168 0.5699191
## 3     mae 0.2253814 0.2275935

#1. Reshape to long format
metrics_long <- metrics_comp %>%
  pivot_longer(cols = c(after, before),
               names_to = "Stage", values_to = "Value") %>%
  mutate(.metric = toupper(.metric))   # convert to uppercase

#2. Plot the comparison chart
ggplot(metrics_long, aes(x = .metric, y = Value, fill = Stage)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(Value, 3)),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  labs(
    title = "Performance Comparison of Model Before and After Hyperparameter Tuning",
    x = "Metric",
    y = "Value",
    fill = "Stage"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key Finding: A marginal improvement was achieved through hyperparameter tuning. However, the moderate R² alongside relatively large RMSE and MAE indicate that the predictive capability of the model for transaction amount is moderately useful.

5.6 Feature Importance

# Get feature importance from the trained RandomForest model

#1. Extract fitted RF model from workflow
rf_model <- extract_fit_parsnip(rf_final_fit)$fit

#2. Get importance scores and convert to tibble
importance_tbl <- importance(rf_model) %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  arrange(desc(IncNodePurity))   # sort by importance

#3. Show top 20 features
top20_tbl <- importance_tbl %>% slice(1:20)

print(top20_tbl)

##                  Feature IncNodePurity
## 1               hour_cos     543.40490
## 2    merchant_fraud_rate     520.74034
## 3               hour_sin     477.52030
## 4                    age     337.01484
## 5         job_fraud_rate     327.33867
## 6   categoryshopping_net     273.46409
## 7        job_trans_count     264.40805
## 8            distance_km     215.92259
## 9   merchant_trans_count     156.17828
## 10              city_pop     154.88854
## 11  categoryshopping_pos     106.20650
## 12      categorymisc_net      88.91660
## 13           weekday_sin      83.46577
## 14           weekday_cos      81.19119
## 15   categorygrocery_pos      80.93233
## 16               stateNM      36.96821
## 17 categorygas_transport      21.75211
## 18               stateMO      20.40670
## 19               stateNE      20.40366
## 20               stateWA      19.08137

p_imp <- rf_final_fit %>%
  extract_fit_engine() %>%
  vip(
    num_features = 20,
    geom = "point",
    aesthetics = list(color = "darkred", size = 3)
  ) +
  labs(
    title = "Tuned Random Forest - Top 20 Feature Importance",
    subtitle = "Based on IncNodePurity (Variance Reduction)",
    x = "Features",
    y = "Importance Score"
  ) +
  theme_minimal()

print(p_imp)

Key Finding: The top predictor is merchant_fraud_rate which indicates that establishing fraud rate metrics is a valuable approach. Temporal features such as hour_sin and hour_cos are the next strongest predictors, highlighting the importance of transaction timing in estimating transaction amounts. Together, these three predictors demonstrate roughly one third stronger predictive strength compared to the remaining features.

5.7 SHAP

#1. Define a prediction wrapper function
p_fun <- function(object, newdata) {
  predict(object, newdata) %>%
    pull(.pred) # extract the numeric prediction column from tidymodels 
}

#2. Calculate SHAP values
X_test <- test_reg_tidy %>% select(-amt) # excluding the target variable

shap_values <- explain(
  rf_final_fit, 
  X = as.matrix(X_test), 
  pred_wrapper = p_fun, 
  nsim = 50, # Number of Monte Carlo repetitions; higher is more accurate but slower
  adjust = TRUE
)

#4. Create a shapviz object for plotting
shp <- shapviz(shap_values, X = as.matrix(X_test))

#5. Plot the Beeswarm Chart
sv_importance(shp, kind = "beeswarm") +
  theme_minimal() +
  labs(title = "SHAP Feature Importance (Beeswarm Plot)")

Key Finding:Merchant_fraud_rate again emerges as the strongest predictor, with higher values consistently driving higher predicted transaction amounts. Temporal features such as hour_cos and hour_sin also rank among the most influential, indicating that the time of day plays a key role in shaping transaction behavior, with certain hours contributing more strongly to larger spending. Shopping categories such as categoryshopping_net (internet transactions) and categoryshopping_pos (point‑of‑sale retail) further demonstrate notable predictive impact particularly in association with higher transaction amounts.

6. Conclusion

This project successfully developed machine learning solutions for credit card fraud detection and transaction amount prediction, achieving strong performance across both tasks.

Classification Task (Fraud Detection)

XGBoost emerged as the best-performing classifier, achieving superior performance in Accuracy, F1-Score, Recall, and ROC-AUC metrics. This demonstrates its effectiveness in identifying fraudulent transactions while balancing precision and recall.
Transaction amount (amt) is the most predictive feature, showing approximately 4x greater importance than the second-ranked feature. This validates the EDA finding that fraudulent transactions tend to involve significantly higher amounts (mean of $518 vs $66 for legitimate transactions).
Temporal features play a critical role, with hour_sin, hour_cos, and weekday_cos ranking among the top predictors, suggesting that fraud patterns exhibit distinct temporal signatures.

Regression Task (Amount Prediction)

Random Forest achieved the best regression performance across all metrics (lowest RMSE, lowest MAE, highest R²), followed closely by XGBoost. This indicates that ensemble tree-based methods effectively capture the complex relationships between features and transaction amounts.
Fraud rate features (merchant_fraud_rate, job_fraud_rate) contribute significantly, as top predictors demonstrating that historical fraud patterns provide valuable context for understanding transaction amounts.
Temporal features dominate amount prediction, with hour_sin and hour_cos being the top three predictive features. This suggests that transaction amounts vary systematically throughout the day.
Overal, with relatively modest performance metrics (R²=0.5764), the regression model is best positioned as a supplementary tool that highlights the main predictors associated with higher transaction amounts. These insights are valuable, as they point to potential risk factors in cases where transactions may be fraudulent. Overall, this outcome aligns well with the project’s objectives by providing interpretability and actionable guidance, even if predictive accuracy is not the primary strength.

Limitations

The dataset covers a limited temporal window seasonal, in which fraud patterns beyond the observed months may exist.
Limited feature engineering fully relied on domain knowledge.
SMOTE oversampling although effective but still generates synthetic samples that may not fully represent real-world fraud diversity.

Group Project

Group 1

2025-12-29