TEAM MEMBERS - G15

Kevin Noel Andrew - 24087626 Rupika - 24082992 Nicole Grace Yeo Li Ying - 17201831 Yamunah Selvamani - 24065911 Thieveelan - 24200078

INTRODUCTION

In our modern digital economy, fraudulent credit card transactions pose a significant threat for both consumers and financial organisations. As online transactions expand, identifying fraudulent and unauthorised activities has grown more challenging. Machine learning delivers reliable techniques that can enhance fraud detection; however, it is crucial to systematically evaluate the effectiveness of these models to ensure their reliability in real-time scenarios.

This project mainly focuses on three main goals. Firstly, it aims to determine the efficacy of machine learning models in detecting credit card transactions as fraudulent or legitimate, using a variety of performance metrics. Secondly, it seeks to identify the key features that relate to the fraudulent transactions and study their impact on the performance of detection models. Lastly, this study also strives to develop a regression model that predicts transaction amounts based on features such as merchant, category, consumer demographics, and location of the transaction, seeking to provide insights into spending patterns and detecting anomalies. The Credit Card Transactions Fraud Detection Dataset from Kaggle was used in this research analysis. The dataset comprising details about transactions, consumer demographics, and merchant’s data.

PROBLEM STATEMENT

Fraudulent credit card transactions continue to pose a substantial concern for both financial companies and consumers. Conventional methods of fraud detection are inadequate in addressing the rising volume and complexity of fraudulent operations, resulting in major financial losses and diminished consumer trust.

The dataset utilised for credit card fraudulent detection displays class imbalance, with fraudulent transactions constituting a minor portion of the overall transactions. This class imbalance hinders detection of the machine learning models to effectively detect fraud, resulting in difficulties in attaining reliable and real-time detection.

This study seeks to tackle these difficulties by implementing machine learning methodologies to enhance the ability to detect fraudulent transactions. Furthermore, comprehending the patterns associated with transaction amounts can allow significant insights into the spending behaviours and contribute to detecting unusual patterns that may indicate fraudulent. Building precise classification and regression models can assist financial organisations in improving their fraud mitigation strategies and reducing risks.

OBJECTIVES

  1. To determine the efficacy of machine learning models in detecting credit card transactions as fraudulent or legitimate.

  2. To identify the key features that relate to the fraudulent transactions and study their impact on the performance of detection models.

  3. To develop a regression model that predicts transaction amounts based on features such as merchant, category, consumer demographics, and location of the transaction, seeking to provide insights into spending patterns and detecting anomalies.

METHODOLOGY

Data Cleaning

Key takeaways:

Initializing Notebook Requirement

Importing Libraries

library(readr)
## Warning: package 'readr' was built under R version 4.4.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(stringr)
## Warning: package 'stringr' was built under R version 4.4.3

Importing datasets

train_main <- read_csv("fraudTrain.csv", show_col_types = FALSE)  #Training dataset
## New names:
## • `` -> `...1`
test_main <- read_csv("fraudTest.csv", show_col_types = FALSE)   #Test dataset
## New names:
## • `` -> `...1`
#Synthetic dirty data
synthetic <- read_csv("synthetic_dirty_fraud_10k.csv", show_col_types = FALSE)
head(train_main)
## # A tibble: 6 × 23
##    ...1 trans_date_trans_time  cc_num merchant       category    amt first last 
##   <dbl> <dttm>                  <dbl> <chr>          <chr>     <dbl> <chr> <chr>
## 1     0 2019-01-01 00:00:18   2.70e15 fraud_Rippin,… misc_net   4.97 Jenn… Banks
## 2     1 2019-01-01 00:00:44   6.30e11 fraud_Heller,… grocery… 107.   Step… Gill 
## 3     2 2019-01-01 00:00:51   3.89e13 fraud_Lind-Bu… enterta… 220.   Edwa… Sanc…
## 4     3 2019-01-01 00:01:16   3.53e15 fraud_Kutch, … gas_tra…  45    Jere… White
## 5     4 2019-01-01 00:03:06   3.76e14 fraud_Keeling… misc_pos  42.0  Tyler Garc…
## 6     5 2019-01-01 00:04:08   4.77e15 fraud_Stroman… gas_tra…  94.6  Jenn… Conn…
## # ℹ 15 more variables: gender <chr>, street <chr>, city <chr>, state <chr>,
## #   zip <dbl>, lat <dbl>, long <dbl>, city_pop <dbl>, job <chr>, dob <date>,
## #   trans_num <chr>, unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
## #   is_fraud <dbl>
head(test_main)
## # A tibble: 6 × 23
##    ...1 trans_date_trans_time  cc_num merchant category   amt first last  gender
##   <dbl> <dttm>                  <dbl> <chr>    <chr>    <dbl> <chr> <chr> <chr> 
## 1     0 2020-06-21 12:14:25   2.29e15 fraud_K… persona…  2.86 Jeff  Elli… M     
## 2     1 2020-06-21 12:14:33   3.57e15 fraud_S… persona… 29.8  Joan… Will… F     
## 3     2 2020-06-21 12:14:53   3.60e15 fraud_S… health_… 41.3  Ashl… Lopez F     
## 4     3 2020-06-21 12:15:15   3.59e15 fraud_H… misc_pos 60.0  Brian Will… M     
## 5     4 2020-06-21 12:15:17   3.53e15 fraud_J… travel    3.19 Nath… Mass… M     
## 6     5 2020-06-21 12:15:37   3.04e13 fraud_D… kids_pe… 19.6  Dani… Evans F     
## # ℹ 14 more variables: street <chr>, city <chr>, state <chr>, zip <dbl>,
## #   lat <dbl>, long <dbl>, city_pop <dbl>, job <chr>, dob <date>,
## #   trans_num <chr>, unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
## #   is_fraud <dbl>

Merging Synthetic Dirty Data to Raw Data

Merging synthetic dirty data and shuffle with raw data to simulate real-life messy data.

train_main <- rbind(train_main,synthetic)

#Shuffle dataset
set.seed(42)
df_train <- train_main[sample(nrow(train_main)),]
df_train$...1 <- seq_len(nrow(df_train)) #Update index column back to 1
glimpse(df_train)
## Rows: 1,306,675
## Columns: 23
## $ ...1                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ trans_date_trans_time <dttm> 2020-04-06 19:57:00, 2019-02-02 08:43:43, 2019-…
## $ cc_num                <chr> "4992346398065154048", "4169388510116", "3409538…
## $ merchant              <chr> "fraud_Kilback LLC", "fraud_Doyle Ltd", "fraud_G…
## $ category              <chr> "food_dining", "grocery_pos", "kids_pets", "heal…
## $ amt                   <dbl> 12.43, 131.96, 42.56, 1.06, 108.33, 67.08, 46.34…
## $ first                 <chr> "Benjamin", "Marcia", "Tyler", "Michelle", "Davi…
## $ last                  <chr> "Kim", "Molina", "Wright", "Russell", "Nichols",…
## $ gender                <chr> "M", "F", "M", "F", "M", "F", "F", "F", "F", "F"…
## $ street                <chr> "920 Patrick Light", "6744 Jimmy Extensions", "6…
## $ city                  <chr> "Mc Nabb", "Port Gibson", "Warren", "Hawley", "P…
## $ state                 <chr> "IL", "NY", "MI", "MN", "CA", "WV", "TN", "HI", …
## $ zip                   <chr> "61335", "14537", "48088", "56549", "93552", "25…
## $ lat                   <dbl> 41.17300, 43.03300, 42.51640, 46.97770, 34.57150…
## $ long                  <dbl> -89.218700, -77.157500, -82.983200, -96.409200, …
## $ city_pop              <dbl> 532, 207, 134056, 4508, 171170, 1925, 151785, 14…
## $ job                   <chr> "Audiological scientist", "Database administrato…
## $ dob                   <date> 1956-01-09, 1962-09-27, 1980-05-18, 1949-04-24,…
## $ trans_num             <chr> "94f1aea5971686b061d864ad75993476", "cb9af006ec6…
## $ unix_time             <dbl> 1365278220, 1328172223, 1348318610, 1365800290, …
## $ merch_lat             <dbl> 40.380771, 42.567345, 41.948677, 47.519601, 35.0…
## $ merch_long            <dbl> -88.36074, -77.81022, -83.31826, -95.65286, -117…
## $ is_fraud              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Data Cleaning & Understanding

Data Understanding

Calculate percentage of target variable to check data is imbalance or not

percent <- df_train %>% 
  count(is_fraud) %>% 
  mutate(percentage = round(n / sum(n) * 100, 2))
Title Credit Card Transactions Fraud Detection Dataset
Year 01-01-2019 to 31-12-2020
Author Kartik Shenoy
Purpose of Dataset Data generated by simulator to train Machine Learning for Credit Card detection
Dimension Rows: 1306675 Columns: 23
Content Legit: 99.42% Fraud: 0.58%
Summary Imbalanced data set with various of features to be leveraged

Renaming Columns

df_train <- df_train %>% 
  rename(
    "no" = "...1",
    "transaction_datetime" = "trans_date_trans_time",
    "transaction_amount(USD)" = "amt",
    "first_name" = "first",
    "last_name" = "last",
  )

test_main <- test_main %>% 
  rename(
    "no" = "...1",
    "transaction_datetime" = "trans_date_trans_time",
    "transaction_amount(USD)" = "amt",
    "first_name" = "first",
    "last_name" = "last",
  )

df_train$merchant <- sub("^fraud_", "", df_train$merchant)
test_main$merchant <- sub("^fraud_", "", test_main$merchant)
colnames(df_train)
##  [1] "no"                      "transaction_datetime"   
##  [3] "cc_num"                  "merchant"               
##  [5] "category"                "transaction_amount(USD)"
##  [7] "first_name"              "last_name"              
##  [9] "gender"                  "street"                 
## [11] "city"                    "state"                  
## [13] "zip"                     "lat"                    
## [15] "long"                    "city_pop"               
## [17] "job"                     "dob"                    
## [19] "trans_num"               "unix_time"              
## [21] "merch_lat"               "merch_long"             
## [23] "is_fraud"

Data Types conversion

Many of the columns is in double data types, which can require more computational resource. We will convert these columns into integer type and is_fraud to factor data type.

df_train <- df_train %>% 
  mutate(
    no = as.integer(no),
    zip = as.integer(zip),
    city_pop = as.integer(city_pop),
    is_fraud = as.factor(is_fraud)
  )
glimpse(df_train)
## Rows: 1,306,675
## Columns: 23
## $ no                        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ transaction_datetime      <dttm> 2020-04-06 19:57:00, 2019-02-02 08:43:43, 2…
## $ cc_num                    <chr> "4992346398065154048", "4169388510116", "340…
## $ merchant                  <chr> "Kilback LLC", "Doyle Ltd", "Gleason and Son…
## $ category                  <chr> "food_dining", "grocery_pos", "kids_pets", "…
## $ `transaction_amount(USD)` <dbl> 12.43, 131.96, 42.56, 1.06, 108.33, 67.08, 4…
## $ first_name                <chr> "Benjamin", "Marcia", "Tyler", "Michelle", "…
## $ last_name                 <chr> "Kim", "Molina", "Wright", "Russell", "Nicho…
## $ gender                    <chr> "M", "F", "M", "F", "M", "F", "F", "F", "F",…
## $ street                    <chr> "920 Patrick Light", "6744 Jimmy Extensions"…
## $ city                      <chr> "Mc Nabb", "Port Gibson", "Warren", "Hawley"…
## $ state                     <chr> "IL", "NY", "MI", "MN", "CA", "WV", "TN", "H…
## $ zip                       <int> 61335, 14537, 48088, 56549, 93552, 25442, 37…
## $ lat                       <dbl> 41.17300, 43.03300, 42.51640, 46.97770, 34.5…
## $ long                      <dbl> -89.218700, -77.157500, -82.983200, -96.4092…
## $ city_pop                  <int> 532, 207, 134056, 4508, 171170, 1925, 151785…
## $ job                       <chr> "Audiological scientist", "Database administ…
## $ dob                       <date> 1956-01-09, 1962-09-27, 1980-05-18, 1949-04…
## $ trans_num                 <chr> "94f1aea5971686b061d864ad75993476", "cb9af00…
## $ unix_time                 <dbl> 1365278220, 1328172223, 1348318610, 13658002…
## $ merch_lat                 <dbl> 40.380771, 42.567345, 41.948677, 47.519601, …
## $ merch_long                <dbl> -88.36074, -77.81022, -83.31826, -95.65286, …
## $ is_fraud                  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Missing values, duplicates and class imbalance

Checking missing values

There are more than 900 records missing in a few columns. Grouping by is_fraud to check whether it is feasible to drop columns, and since only small amount of fraud transaction has missing values, all columns will be dropped.

df_train %>% 
  group_by(is_fraud) %>% 
  summarise(across(everything(), ~ sum(is.na(.))))
## # A tibble: 2 × 23
##   is_fraud    no transaction_datetime cc_num merchant category
##   <fct>    <int>                <int>  <int>    <int>    <int>
## 1 0            0                    0      0      952      996
## 2 1            0                    0      0       10        4
## # ℹ 17 more variables: `transaction_amount(USD)` <int>, first_name <int>,
## #   last_name <int>, gender <int>, street <int>, city <int>, state <int>,
## #   zip <int>, lat <int>, long <int>, city_pop <int>, job <int>, dob <int>,
## #   trans_num <int>, unix_time <int>, merch_lat <int>, merch_long <int>
df_train <- df_train %>% drop_na()

#Validate columns if dropped
df_train %>% summarise(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 23
##      no transaction_datetime cc_num merchant category `transaction_amount(USD)`
##   <int>                <int>  <int>    <int>    <int>                     <int>
## 1     0                    0      0        0        0                         0
## # ℹ 17 more variables: first_name <int>, last_name <int>, gender <int>,
## #   street <int>, city <int>, state <int>, zip <int>, lat <int>, long <int>,
## #   city_pop <int>, job <int>, dob <int>, trans_num <int>, unix_time <int>,
## #   merch_lat <int>, merch_long <int>, is_fraud <int>

Removing duplicates

df_train <- df_train %>% distinct()

dim_before <- dim(train_main) #Dimension for main dataset (no cleaning done)
dim_after <- dim(df_train)    #Dimension after duplicate removes

if (identical(dim_before,dim_after)){
  print("No duplicates found")
}else{
  print("Duplicates found and removed")
}
## [1] "Duplicates found and removed"
sprintf("Dimension before removing duplicates: %s", paste(dim_before , collapse = " x "))
## [1] "Dimension before removing duplicates: 1306675 x 23"
sprintf("Dimension after removing duplicates: %s", paste(dim_after, collapse = " x "))
## [1] "Dimension after removing duplicates: 1300491 x 23"

Checking class imbalance

df_train %>% 
  count(is_fraud) %>% 
  mutate(percentage = n / sum(n) * 100) %>% 
  ggplot(aes(x = as.factor(is_fraud), y = percentage, fill = is_fraud)) +
  geom_col() +
  geom_text(aes(label = paste0(round(percentage, 2), "%")), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("0" = "#A8E6CF", "1" = "#FF8B94")) +
  labs(
    title = "Class Distribution: Fraud vs Legit Transaction",
    x = "Transaction Type (0 = Legit, 1 = Fraud)",
    y = "Percentage",
    fill = "Transaction Type"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

The dataset only contain 0.58% of fraud transaction which is severly imbalance. To address this we will either use SMOTE to oversample data by generating synthetic data, or using AUPRC evaluation metrics.

Data Manipulation

colnames(df_train)
##  [1] "no"                      "transaction_datetime"   
##  [3] "cc_num"                  "merchant"               
##  [5] "category"                "transaction_amount(USD)"
##  [7] "first_name"              "last_name"              
##  [9] "gender"                  "street"                 
## [11] "city"                    "state"                  
## [13] "zip"                     "lat"                    
## [15] "long"                    "city_pop"               
## [17] "job"                     "dob"                    
## [19] "trans_num"               "unix_time"              
## [21] "merch_lat"               "merch_long"             
## [23] "is_fraud"

Standardizing case and whitespaces

#Title case selected columns
df_train <- df_train %>% 
  mutate(across(c('first_name','last_name','gender','street','city'), ~ str_to_title(str_trim(.))))

#Upper case all merchant name due to ambiguity in naming and abbreviations
df_train <- df_train %>% 
  mutate(merchant = toupper(str_trim(merchant)))

Dropping unnecessary rows and columns

unique(df_train$is_fraud)
## [1] 0 1
## Levels: 0 1
#Filter first name and last name that is not alphabet
df_train %>% 
  filter((str_detect(first_name, "[^A-Za-z]") | str_detect(first_name, "[^A-Za-z]")) & is_fraud == 0)
## # A tibble: 782 × 23
##       no transaction_datetime cc_num    merchant category transaction_amount(U…¹
##    <int> <dttm>               <chr>     <chr>    <chr>                     <dbl>
##  1  4395 2021-03-12 05:10:14  42820619… HOWARD,… health_…                   61.6
##  2  4987 2023-02-14 03:56:50  34014471… CASTILL… gas_tra…                   16.2
##  3  5723 2023-08-30 18:43:06  37345522… OBRIEN … enterta…                   25  
##  4  6489 2021-12-30 01:03:11  30553184… CROSBY,… shoppin…                  155. 
##  5 11728 2022-12-23 19:56:35  46095077… DOUGHER… health_…                  118. 
##  6 13461 2023-03-25 12:31:39  27101503… JAMES, … home                       21.9
##  7 14773 2021-12-10 16:45:41  48005320… LAMB-GO… travel                     17.5
##  8 16716 2020-03-26 17:18:04  44715514… KELLY L… misc_pos                  171. 
##  9 19908 2021-05-27 00:03:00  30384912… ENWSSVO… misc_pos                   23.4
## 10 20311 2021-08-18 07:55:07  42359668… LOWE-KH… misc_net                   81.1
## # ℹ 772 more rows
## # ℹ abbreviated name: ¹​`transaction_amount(USD)`
## # ℹ 17 more variables: first_name <chr>, last_name <chr>, gender <chr>,
## #   street <chr>, city <chr>, state <chr>, zip <int>, lat <dbl>, long <dbl>,
## #   city_pop <int>, job <chr>, dob <date>, trans_num <chr>, unix_time <dbl>,
## #   merch_lat <dbl>, merch_long <dbl>, is_fraud <fct>
#Dropping rows that is a legit transaction with inconsistent first and last name
df_train <- df_train %>% 
  filter(!str_detect(first_name, "^[A-Za-z]") &
         !str_detect(last_name, "^[A-Za-z]")
        )
#Drop unix time columns, redundant with transaction_datetime column
df_train %>% select(-unix_time)
## # A tibble: 0 × 22
## # ℹ 22 variables: no <int>, transaction_datetime <dttm>, cc_num <chr>,
## #   merchant <chr>, category <chr>, transaction_amount(USD) <dbl>,
## #   first_name <chr>, last_name <chr>, gender <chr>, street <chr>, city <chr>,
## #   state <chr>, zip <int>, lat <dbl>, long <dbl>, city_pop <int>, job <chr>,
## #   dob <date>, trans_num <chr>, merch_lat <dbl>, merch_long <dbl>,
## #   is_fraud <fct>

Manipulate gender column having more than two genders

Gender column having unexpected gender input other than M (male) or F (female).

unique(df_train$gender)
## character(0)

Before removing, we will assume their gender based on their name.

df_train %>% filter((gender == "Unknown") | (gender == "U"))
## # A tibble: 0 × 23
## # ℹ 23 variables: no <int>, transaction_datetime <dttm>, cc_num <chr>,
## #   merchant <chr>, category <chr>, transaction_amount(USD) <dbl>,
## #   first_name <chr>, last_name <chr>, gender <chr>, street <chr>, city <chr>,
## #   state <chr>, zip <int>, lat <dbl>, long <dbl>, city_pop <int>, job <chr>,
## #   dob <date>, trans_num <chr>, unix_time <dbl>, merch_lat <dbl>,
## #   merch_long <dbl>, is_fraud <fct>
library("babynames")
## Warning: package 'babynames' was built under R version 4.4.3
name_gender_map <- babynames %>% 
  group_by(name,sex) %>% 
  summarise(n = sum(n), .groups = "drop") %>% 
  group_by(name) %>% 
  slice_max(n, n = 1) %>% 
  ungroup()

df_train <- df_train %>% 
  select(-gender) %>% 
  left_join(name_gender_map, by = c("first_name" = "name")) %>% 
  rename(gender=sex)
#Validate changes
df_train %>% filter((gender == "Unknown") | (gender == "U"))
## # A tibble: 0 × 24
## # ℹ 24 variables: no <int>, transaction_datetime <dttm>, cc_num <chr>,
## #   merchant <chr>, category <chr>, transaction_amount(USD) <dbl>,
## #   first_name <chr>, last_name <chr>, street <chr>, city <chr>, state <chr>,
## #   zip <int>, lat <dbl>, long <dbl>, city_pop <int>, job <chr>, dob <date>,
## #   trans_num <chr>, unix_time <dbl>, merch_lat <dbl>, merch_long <dbl>,
## #   is_fraud <fct>, gender <chr>, n <int>
#Taking advantage of unable to map gender to first name based on our name corpus. We will drop these name as they do not carry significant value.
df_train <- df_train %>% 
  filter(!(is.na(gender) & is_fraud == 0))

Gibberish dropping street naming

#Separating street number and street name to filter out gibberish street naming
df_street_manipulate <- df_train %>%
  mutate(
          street_number = str_extract(street, "^\\d+"),
          street_temp = str_remove(street, "^\\d+\\s*"),
          street_name = str_remove(street_temp, "\\s*\\d+$"),
          street_back_number = str_extract(street_temp,"\\d+$")
        ) %>% 
  select(-c(street,street_temp))

#Merge back the dataset
df_street_manipulate <- df_street_manipulate %>%
  mutate(street_back_number = ifelse(is.na(street_back_number), "",street_back_number))

#Display gibberish name
df_street_manipulate %>%
  filter(
    ( !str_detect(street_name, "[^A-Za-z]") & 
       !str_detect(street_name, "\\bApt\\b") & 
       is_fraud == 0 )
    ) %>% select(street_name)
## # A tibble: 0 × 1
## # ℹ 1 variable: street_name <chr>
#Fallback code
#test <- df_street_manipulate
#table(df_street_manipulate$is_fraud)
#Dropping gibberish street naming
df_train <- df_street_manipulate %>% 
  filter(!(str_detect(street_name, "[^A-Za-z]") & str_detect(street_name, "\\bApt\\b") & is_fraud == 0)) %>% 
  unite(street, c(street_number, street_name, street_back_number), sep = " ", remove = TRUE) %>%
  mutate(street = str_squish(street))  # remove extra spaces

table(df_train$is_fraud)
## 
## 0 1 
## 0 0
#reorder columns 
reorder_columns <- c("no","transaction_datetime","cc_num","merchant","category","transaction_amount(USD)",
                     "first_name","last_name","gender","street","city","state","zip","lat","long","city_pop",
                     "job","dob","trans_num","unix_time","merch_lat","merch_long","is_fraud")
df_train <- df_train[,reorder_columns]

head(df_train)
## # A tibble: 0 × 23
## # ℹ 23 variables: no <int>, transaction_datetime <dttm>, cc_num <chr>,
## #   merchant <chr>, category <chr>, transaction_amount(USD) <dbl>,
## #   first_name <chr>, last_name <chr>, gender <chr>, street <chr>, city <chr>,
## #   state <chr>, zip <int>, lat <dbl>, long <dbl>, city_pop <int>, job <chr>,
## #   dob <date>, trans_num <chr>, unix_time <dbl>, merch_lat <dbl>,
## #   merch_long <dbl>, is_fraud <fct>

Dropping unrelated city names

#Extract valid cities name based on maps library
library(maps)
## Warning: package 'maps' was built under R version 4.4.3
valid_cities <- maps::us.cities$name
valid_cities_cleaned <- str_remove(valid_cities, "\\s+[A-Z]{2}$") %>% unique()
#Flag city that is gibberish
df_train <- df_train %>% 
  mutate(city_clean = str_to_lower(city)) %>%
  mutate(city_is_gibberish = !str_detect(city_clean, "^[a-z\\s\\-\\.]+$") |
                              str_detect(city_clean, "[aeiou]{3,}") |     # unlikely vowel triplets
                              str_detect(city_clean, ".*(.)\\1{2,}.*"))    # repeated characters

df_train %>% filter(city_is_gibberish) %>% count(city_clean, sort = TRUE)
## # A tibble: 0 × 2
## # ℹ 2 variables: city_clean <chr>, n <int>
df_train <- df_train %>%
  filter(!city_is_gibberish) %>%
  select(-c('city_is_gibberish','city_clean'))

Overview of cleaned data

Data Cleaning & Manipulation Summary

Step Description Location / Note
Import Libraries Loaded readr, dplyr, tidyr, ggplot2, stringr. Setup phase
Load Datasets Imported fraudTrain.csv, fraudTest.csv, and synthetic_dirty_fraud_10k.csv. Data import section
Rename Columns Renamed columns like amt → transaction_amount(USD), first → first_name. Early transformation
Standardize Case Used str_trim() + str_to_title() for name, merchant, street, city fields. For consistency across character fields
Remove Prefix in Merchant Removed "fraud_" prefix from merchant names. After import
Uppercase Merchant Names Converted merchant names to uppercase. Ensured consistency
Filter by City Name Dropped rows where city names are not found in {maps::us.cities}. Matched against US city corpus
Remove Invalid Names Filtered out non-alphabetic first_name / last_name (only in non-fraud rows). Used regex filtering
Drop Redundant Column Dropped unix_time which duplicated timestamp info. Confirmed overlap with transaction_datetime
Handle Missing Values Used drop_na() to remove any rows with missing values. Confirmed before and after row counts
Remove Duplicates Removed exact duplicates using distinct(). Verified row reduction
Data Type Conversion Converted zip, city_pop, no to integer; is_fraud to factor. For memory optimization
Gender Reassignment Mapped first_name to most common gender using {babynames} dataset. Removed old gender, used cleaned join + added new column
Column Rearrangement Moved gender column after last_name; removed helper column n. Final dataset formatting
write.csv(df_train, file="df_train_cleaned.csv")
write.csv(test_main, file="df_test_cleaned.csv")

Exploratory Data Analysis

Data Visualization

In this section, the team will transform the data into graphs and diagrams to provide a visual approach for understanding the dataset’s characteristics and patterns prior to conducting data analysis .Basic summary statistics which includes the mean, median, standard deviation, minimum, and maximum were computed for numerical variables to assess central tendency and variability.

df <- read.csv("C:/Users/User/Downloads/df_train_cleaned (1).csv")
# Fix character encoding issues
df <- df %>%
  mutate(across(where(is.character), ~ iconv(., from = "", to = "UTF-8", sub = "byte")))

glimpse(df)
## Rows: 965,970
## Columns: 24
## $ X                       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ no                      <int> 1, 2, 4, 5, 6, 7, 10, 11, 14, 15, 16, 17, 19, …
## $ transaction_datetime    <chr> "2020-04-06 19:57:00", "2019-02-02 08:43:43", …
## $ cc_num                  <dbl> 4.992346e+18, 4.169389e+12, 4.384910e+15, 3.53…
## $ merchant                <chr> "KILBACK LLC", "DOYLE LTD", "KLOCKO, RUNOLFSDO…
## $ category                <chr> "food_dining", "grocery_pos", "health_fitness"…
## $ transaction_amount.USD. <dbl> 12.43, 131.96, 1.06, 108.33, 67.08, 46.34, 32.…
## $ first_name              <chr> "Benjamin", "Marcia", "Michelle", "David", "Th…
## $ last_name               <chr> "Kim", "Molina", "Russell", "Nichols", "Blackw…
## $ gender                  <chr> "M", "F", "F", "M", "F", "F", "F", "F", "F", "…
## $ street                  <chr> "920 Patrick Light", "6744 Jimmy Extensions", …
## $ city                    <chr> "Mc Nabb", "Port Gibson", "Hawley", "Palmdale"…
## $ state                   <chr> "IL", "NY", "MN", "CA", "WV", "TN", "CO", "AR"…
## $ zip                     <int> 61335, 14537, 56549, 93552, 25442, 37040, 8164…
## $ lat                     <dbl> 41.1730, 43.0330, 46.9777, 34.5715, 39.3716, 3…
## $ long                    <dbl> -89.2187, -77.1575, -96.4092, -118.0231, -77.8…
## $ city_pop                <int> 532, 207, 4508, 171170, 1925, 151785, 61, 5161…
## $ job                     <chr> "Audiological scientist", "Database administra…
## $ dob                     <chr> "1956-01-09", "1962-09-27", "1949-04-24", "196…
## $ trans_num               <chr> "94f1aea5971686b061d864ad75993476", "cb9af006e…
## $ unix_time               <dbl> 1365278220, 1328172223, 1365800290, 1364693326…
## $ merch_lat               <dbl> 40.38077, 42.56735, 47.51960, 35.02748, 39.358…
## $ merch_long              <dbl> -88.36074, -77.81022, -95.65286, -117.71576, -…
## $ is_fraud                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
skim(df)
Data summary
Name df
Number of rows 965970
Number of columns 24
_______________________
Column type frequency:
character 12
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
transaction_datetime 0 1 10 19 0 953820 0
merchant 0 1 7 37 0 2043 0
category 0 1 4 14 0 15 0
first_name 0 1 2 11 0 514 0
last_name 0 1 2 13 0 839 0
gender 0 1 1 1 0 2 0
street 0 1 12 38 0 2310 0
city 0 1 3 25 0 2180 0
state 0 1 2 2 0 51 0
job 0 1 3 59 0 1569 0
dob 0 1 10 10 0 2221 0
trans_num 0 1 32 36 0 965970 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 4.829855e+05 2.788517e+05 1.000000e+00 2.414932e+05 4.829855e+05 7.244777e+05 9.659700e+05 ▇▇▇▇▇
no 0 1 6.532207e+05 3.772152e+05 1.000000e+00 3.263863e+05 6.532625e+05 9.798387e+05 1.306675e+06 ▇▇▇▇▇
cc_num 0 1 4.308004e+17 1.328329e+18 6.041292e+10 1.800655e+14 3.518235e+15 4.607073e+15 4.998590e+18 ▇▁▁▁▁
transaction_amount.USD. 0 1 7.130000e+01 1.638500e+02 3.000000e-02 9.690000e+00 4.782000e+01 8.347000e+01 2.894890e+04 ▇▁▁▁▁
zip 0 1 4.805588e+04 2.700881e+04 6.000000e+00 2.510600e+04 4.747000e+04 7.123200e+04 9.994100e+04 ▇▇▇▇▅
lat 0 1 3.867000e+01 5.570000e+00 -8.999000e+01 3.502000e+01 3.957000e+01 4.210000e+01 8.996000e+01 ▁▁▁▇▁
long 0 1 -8.967000e+01 1.453000e+01 -1.798700e+02 -9.648000e+01 -8.697000e+01 -7.999000e+01 1.799700e+02 ▁▇▁▁▁
city_pop 0 1 9.183142e+04 3.163016e+05 2.300000e+01 7.370000e+02 2.566000e+03 2.047800e+04 2.906700e+06 ▇▁▁▁▁
unix_time 0 1 1.349698e+09 1.760215e+07 1.325376e+09 1.338762e+09 1.349302e+09 1.359486e+09 1.748368e+09 ▇▁▁▁▁
merch_lat 0 1 3.867000e+01 5.630000e+00 -8.988000e+01 3.503000e+01 3.956000e+01 4.208000e+01 8.921000e+01 ▁▁▁▇▁
merch_long 0 1 -8.967000e+01 1.454000e+01 -1.798000e+02 -9.636000e+01 -8.715000e+01 -8.000000e+01 1.798500e+02 ▁▇▁▁▁
is_fraud 0 1 1.000000e-02 9.000000e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 ▇▁▁▁▁

Univariate Analysis

Univariate analysis was conducted to explore the distribution and characteristics of the dataset.First, the target variable which is the fraudulent transactions was examined using frequency counts and the proportion to identify class imbalance .Furthermore, Histograms visualize their distributions, detect skewness, and identify potential outliers.

table(df$is_fraud)
## 
##      0      1 
## 958527   7443
prop.table(table(df$is_fraud))
## 
##           0           1 
## 0.992294792 0.007705208
# Calculate distance between customer and merchant locations (km)
df <- df %>%
  mutate(distance_km = distHaversine(cbind(long, lat), cbind(merch_long, merch_lat)) / 1000)
# Convert DOB to Date and calculate age
df$dob <- as.Date(df$dob)
df <- df %>%
  mutate(age = as.integer(interval(start = dob, end = today()) / years(1)))
# Histograms of various features
p1 <- ggplot(df, aes(x = log(transaction_amount.USD. + 1))) +
  geom_histogram(fill = "lightblue", bins = 30, color = "black") +
  labs(title = "Histogram of Log(Transaction Amount + 1)", x = "Log(Transaction Amount + 1)", y = "Frequency") +
  theme_minimal()

p2 <- ggplot(df, aes(x = log(distance_km + 1))) +
  geom_histogram(fill = "lightgreen", bins = 30, color = "black") +
  labs(title = "Histogram of Log(Distance + 1) (km)", x = "Log(Distance + 1)", y = "Frequency") +
  theme_minimal()

p3 <- ggplot(df, aes(x = age)) +
  geom_histogram(fill = "lightpink", bins = 30, color = "black") +
  labs(title = "Histogram of Customer Age", x = "Age (years)", y = "Frequency") +
  theme_minimal()

p4 <- ggplot(df, aes(x = log1p(city_pop))) +
  geom_histogram(fill = "lightyellow", bins = 30, color = "black") +
  labs(title = "Histogram of Log(City Population + 1)", x = "Log(City Population + 1)", y = "Frequency") +
  theme_minimal()

(p1 | p2) / (p3 | p4)

As a part of the exploratory data analysis , outlier detection was conducted on transaction amount variables ,using the Interquartile Range (IQR) method .First quartile (Q1), third quartile (Q3), and the interquartile range .

Instead of removing these outliers, we flagged them by creating a new binary column indicating the presence. This approach preserved potentially valuable information, especially in the context of fraud detection where unusual transaction behavior may be indicative of fraudulent transactions.By flagging rather than filtering, we retained the complete dataset and machine learning model was able to leverage outlier information as a possible predictor of fraud.

# Detect outliers in transaction amount using IQR
Q1 <- quantile(df$transaction_amount.USD., 0.25, na.rm = TRUE)
Q3 <- quantile(df$transaction_amount.USD., 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

df <- df %>%
  mutate(outlier_flag = ifelse(transaction_amount.USD. < lower_bound | transaction_amount.USD. > upper_bound, "Outlier", "Normal"))

table(df$outlier_flag)
## 
##  Normal Outlier 
##  915036   50934

The interquartile range (IQR) method was employed to detect outliers in the transaction amount data. The first quartile (Q1) was approximately $9.69, and the third quartile (Q3) was around $83.47, resulting in an IQR of $73.78. Using this, the lower bound was calculated as -$100.98 and the upper bound as $194.14. Since transaction amounts cannot be negative, the threshold for identifying outliers was set at values exceeding $194.14. Transactions above this value are considered outliers and may represent unusually large purchases or potential instances of fraud. Identifying and analyzing these high-value transactions is important for understanding customer spending behavior and for implementing fraud detection strategies.

For continuous variables, such as transaction amount ,population size of a city , , log-transformations were applied to normalize distributions. This transformation reduced skewness and improved distributional comparability. The log-transformed variables were then visualized using density plots, stratified by fraud status, to detect distinguishing patterns between fraudulent and non-fraudulent transactions.

Insights

The univariate analysis of the dataset provided valuable insights into the distribution and characteristics of the variables. The dataset contains 965,970 transactions with 24 variables—12 numeric and 12 characters with no missing values. Key character variables include transaction details (e.g., transaction_datetime, merchant, category), personal data (first_name, last_name, gender, dob, job), and location info (street, city, state). The target variable is_fraud is highly imbalanced, with only ~0.77% fraudulent transactions. Numeric features like transaction_amount.USD. are right-skewed with potential outliers, while city_pop is heavily skewed, reflecting a mix of small towns and large cities. Time-related features such as transaction_datetime and unix_time enable extraction of temporal patterns, and coordinates (lat, long) allow calculation of transaction distance. High-cardinality variables like merchant, job, and trans_num may need encoding or grouping for modeling.

The log-transformed transaction amount displayed a roughly normal distribution and skewed slightly to the right, indicating that most transactions were of small to moderate value, while a few high-value transactions may signify potential outliers or fraudulent activity. The distribution of the log-transformed distance between customer and merchant was highly right-skewed, This shows that most of the transactions occurred over short distances, with a notable peak around log(5).However, a small number of long-distance transactions may warrant further investigation as possible anomalies. Furthermore, customer ages exhibited a bimodal distribution, with transaction activity concentrated among individuals aged 25 to 65 years, peaking around 30–40 and again at 50–60 years, which may reflect typical user demographics; very young or very old customers were underrepresented and could be further scrutinized. The log-transformed city population was moderately right-skewed, .This shows that most transactions originated from moderately to densely populated cities.

Overall, this univariate analysis helped uncover the central tendencies, variability, and potential outliers in the data, offering foundational insights for understanding customer behavior and informing subsequent stages of fraud detection analysis.

# Prepare data for density plots
df_long <- df %>%
  select(is_fraud, transaction_amount.USD., city_pop) %>%
  mutate(log_amount = log1p(transaction_amount.USD.),
         log_city_pop = log1p(city_pop)) %>%
  select(is_fraud, log_amount, log_city_pop) %>%
  pivot_longer(cols = c(log_amount, log_city_pop), names_to = "variable", values_to = "value")

# Plot density by fraud status
ggplot(df_long, aes(x = value, fill = factor(is_fraud))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  scale_fill_manual(values = c("0" = "salmon", "1" = "turquoise"), labels = c("No", "Yes")) +
  labs(title = "Density Plot of Log-Transformed Transaction Amount and City Population by Fraud Status",
       x = "Log-Transformed Value", y = "Density", fill = "Is Fraud") +
  theme_minimal()

### Bivariate Analysis

Bivariate analysis was carried out to examine the relationships between predictor variables and target variables which are fraudulent transactions . Fraud rate was analyzed across categorical features such as transaction category and time of day were visualized using bar charts to identify high-risk categories.

# Fraud rate by category
df %>%
  group_by(category) %>%
  summarise(fraud_rate = mean(is_fraud, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(category, fraud_rate), y = fraud_rate)) +
  geom_col(fill = "red") +
  coord_flip() +
  labs(title = "Fraud Rate by Category", y = "Fraud Rate")

Furthermore, to explore how geographic distance influences fraud transactions ,the distance between the customer’s and merchant’s locations was calculated in kilometers using the Haversine formula and grouped into distance buckets. The fraud rate within each bucket was then visualized using a bar chart to detect distance-based fraud patterns.

# Calculate distance between customer and merchant locations (km)
df <- df %>%
  mutate(distance_km = distHaversine(cbind(long, lat), cbind(merch_long, merch_lat)) / 1000)

# Fraud rate by distance buckets
df %>%
  mutate(distance_bucket = cut(distance_km, breaks = c(0, 10, 50, 100, 500, 1000, Inf))) %>%
  group_by(distance_bucket) %>%
  summarise(fraud_rate = mean(is_fraud)) %>%
  ggplot(aes(x = distance_bucket, y = fraud_rate)) +
  geom_col(fill = "darkred") +
  labs(title = "Fraud Rate by Distance", x = "Distance Bucket (km)", y = "Fraud Rate")

### Bivariate Analysis

The analysis of fraud rates by transaction category reveals that online shopping categories such as shopping_net and misc_net exhibit the highest fraud rates, suggesting that digital platforms are more vulnerable to fraudulent activity. On the other hand, routine spending categories such as personal care , kids, pets , health and fitness report the lowest fraud rates, indicating they are less attractive to fraudsters. These findings suggest that fraudsters primarily target online and less regulated transaction types over everyday, in-person purchases.

Furthermore, the analysis of fraud rate by transaction distance reveals distinct patterns that can inform fraud detection strategies. Fraud rates are lowest for shorter distances, particularly in the (0,10], (10,50], and (50,100] kilometer ranges, likely reflecting local or nearby transactions that are typically more legitimate. However, the fraud rate spikes and was the highest in the (500,1000] range, indicating that medium-to-long-distance transactions are more suspicious and potentially fraudulent. Interestingly, for very long distances (over 1000 km), the fraud rate declines, which may be due to a smaller number of such transactions or the presence .

The analysis of transaction amount and city population in relation to fraud revealed that fraudulent transactions are more commonly associated with higher log-transformed transaction amounts, suggesting that fraudsters tend to target high-value transactions. In contrast, non-fraudulent transactions are more evenly distributed across smaller to medium transaction amounts. Additionally, when analyzing city population, both fraudulent and non-fraudulent transactions are concentrated in moderately to highly populated cities (log values between 7 and 9). This shows that fraudulent transactions occur slightly more frequently in highly populated areas, though this pattern is not as strong as that observed for transaction amounts. Overall, transaction amount emerges as a more influential factor in fraud detection compared to city population, making it a key feature for building predictive models.

Multivariate analysis

To study the joint relationships among multiple features and detect patterns indicative of fraudulent behavior, several multivariate analysis techniques were applied. These include correlation analysis and logistic regression modeling.

Correlation analysis was performed by computing the correlation matrix using pearson correlation to evaluate the linear relationships with the transaction amount , city population , customer’s age and distance in kilometers with the fraud transactions . This analysis helps to assess linear relationships between variables and identify multicollinearity, which could influence the performance of models.

# Correlation matrix of numeric variables + fraud flag
num_vars <- df %>%
  select(transaction_amount.USD., city_pop, age, distance_km, is_fraud) %>%
  na.omit()

corrplot(cor(num_vars), method = "color", type = "upper",
         tl.col = "black", addCoef.col = "black",
         col = colorRampPalette(c("red", "white", "blue"))(200))

Additionally , logistic regression analysis was conducted to obtain explanatory insights early in the analysis.

# Convert dob to Date and calculate age in years
df <- df %>%
  mutate(
    dob = as.Date(dob),  # Convert dob to Date format
    age = as.integer(time_length(interval(dob, Sys.Date()), "years"))  # Calculate age in years
  )

# Prepare data for modeling
df_model <- df %>%
  select(is_fraud, transaction_amount.USD., distance_km, city_pop, age) %>%
  na.omit()

# Fit logistic regression with interaction terms
model <- glm(is_fraud ~ transaction_amount.USD. * distance_km + city_pop + age,
             data = df_model, family = binomial)

summary(model)
## 
## Call:
## glm(formula = is_fraud ~ transaction_amount.USD. * distance_km + 
##     city_pop + age, family = binomial, data = df_model)
## 
## Coefficients:
##                                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                         -5.876e+00  3.966e-02 -148.149   <2e-16 ***
## transaction_amount.USD.              3.382e-03  4.450e-05   75.989   <2e-16 ***
## distance_km                          3.351e-05  4.379e-05    0.765   0.4441    
## city_pop                             6.682e-08  3.819e-08    1.749   0.0802 .  
## age                                  9.945e-03  6.636e-04   14.986   <2e-16 ***
## transaction_amount.USD.:distance_km  2.457e-07  3.912e-07    0.628   0.5299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 87262  on 965969  degrees of freedom
## Residual deviance: 75481  on 965964  degrees of freedom
## AIC: 75493
## 
## Number of Fisher Scoring iterations: 10

Multivariate Analysis

The correlation matrix indicates that transaction amount has a moderate positive correlation (0.25) with fraud, suggesting that higher-value transactions are more likely to be fraudulent. Other features such as city population, age, and distance show negligible correlation with fraud, and no strong correlations exist among the features, indicating low multicollinearity which is a favorable condition for modeling.

Additionally , Logistic regression confirms that transaction amount and age are significant predictors of fraud (p < 0.001), while distance and its interaction with transaction amount are not statistically significant. City population has a weak, marginal effect (p ≈ 0.08). The model shows a substantial deviance reduction (from 87,262 to 75,482), indicating a reasonable fit.

In summary, transaction amount and age are key predictors of fraud, while other features have limited impact. The logistic regression model fits the data reasonably well with low multicollinearity among predictors.

Modelling

Use Case 1 - Classify whether a credit card transaction is fraudulent.
Use Case 2 - Predict the transaction amount based on customer and transaction features.

Data Preparation

# Load necessary libraries
library(dplyr)
library(caret)
library(xgboost)
library(pROC)
library(glmnet)


# Loading data
df_train_cleaned <- df_train
df_test_cleaned <- test_main

# Merging Dataset
df_merged <- rbind(df_test_cleaned,df_train_cleaned)
df <- df_merged %>% select("gender", "transaction_amount(USD)", "category", "city_pop", "merch_lat", "merch_long", "is_fraud")

Encoding Data

Encoding is performed to convert categorical variables such as ‘gender’ and ‘category’ into factors for the use in machine learning models.

preprocess <- function(df, label_encoders = NULL) {
  df <- df
  
  # Identify categorical columns (character or factor)
  cat_cols <- names(df)[sapply(df, function(x) is.character(x) || is.factor(x))]
  
  if (is.null(label_encoders)) {
    label_encoders <- list()
  }
  
  for (col in cat_cols) {
    # If not in label_encoders, fit new encoder (convert to factor, keep levels)
    if (is.null(label_encoders[[col]])) {
      df[[col]] <- as.factor(df[[col]])
      label_encoders[[col]] <- levels(df[[col]])
      # Store levels as encoder
    } else {
      # Use the existing levels for encoding, set as factor with specified levels
      df[[col]] <- factor(df[[col]], levels = label_encoders[[col]])
    }
    # Replace column with integer codes (1 based in R)
    df[[col]] <- as.integer(df[[col]])
  }
  
  return(list(df = df, label_encoders = label_encoders))
}


result <- preprocess(df)
encoded_df <- result$df
encoders <- result$label_encoders

UnderSampling Data

To prevent model bias, undersampling is performed to balance the dataset, as non-fraud cases significantly outnumber fraud cases.

# Undersampling

set.seed(42)
fraud <- encoded_df %>% filter(is_fraud == 1)
nonfraud <- encoded_df %>% filter(is_fraud == 0) %>% sample_n(nrow(fraud))
df_bal <- bind_rows(fraud, nonfraud)

# Split data to 80% and 20%
train_index <- createDataPartition(df_bal$is_fraud, p = 0.7, list = FALSE)

df_train_bal <- df_bal[train_index,]
df_test_cls <- df_bal[-train_index,]

x_train <- df_train_bal %>% select(-is_fraud)
y_train <- df_train_bal$is_fraud

x_test <- df_test_cls %>% select(-is_fraud)
y_test <- df_test_cls$is_fraud

# Dataset for Classification
x_train<- x_train
x_train$label <- factor(ifelse(y_train == 1, "yes", "no"))

Use Case 1: Classification (Detecting Fraud)

Two algorithms were used for regression modeling:

  1. Logistic Regression – an interpretable model that provides clear insights into the relationship between features and the target variable.
  2. XGBoost classification – a black-box model known for its high predictive performance but limited interpretability.

Logistic Regression Model

# Dataset for Classification
x_train<- x_train
x_train$label <- factor(ifelse(y_train == 1, "yes", "no"))

# 5-fold cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  verboseIter = TRUE )

# Hyperparameter grid
grid <- expand.grid(
  alpha = c(0, 0.5, 1),                 
  lambda = seq(0.0001, 0.1, length = 10)   
)

# Training logistic regression with glmnet
log_model <- train(
  label ~ .,
  data = x_train,
  method = "glmnet",
  family = "binomial",
  trControl = ctrl,
  tuneGrid = grid,
  metric = "ROC"
)
## + Fold1: alpha=0.0, lambda=0.1 
## - Fold1: alpha=0.0, lambda=0.1 
## + Fold1: alpha=0.5, lambda=0.1 
## - Fold1: alpha=0.5, lambda=0.1 
## + Fold1: alpha=1.0, lambda=0.1 
## - Fold1: alpha=1.0, lambda=0.1 
## + Fold2: alpha=0.0, lambda=0.1 
## - Fold2: alpha=0.0, lambda=0.1 
## + Fold2: alpha=0.5, lambda=0.1 
## - Fold2: alpha=0.5, lambda=0.1 
## + Fold2: alpha=1.0, lambda=0.1 
## - Fold2: alpha=1.0, lambda=0.1 
## + Fold3: alpha=0.0, lambda=0.1 
## - Fold3: alpha=0.0, lambda=0.1 
## + Fold3: alpha=0.5, lambda=0.1 
## - Fold3: alpha=0.5, lambda=0.1 
## + Fold3: alpha=1.0, lambda=0.1 
## - Fold3: alpha=1.0, lambda=0.1 
## + Fold4: alpha=0.0, lambda=0.1 
## - Fold4: alpha=0.0, lambda=0.1 
## + Fold4: alpha=0.5, lambda=0.1 
## - Fold4: alpha=0.5, lambda=0.1 
## + Fold4: alpha=1.0, lambda=0.1 
## - Fold4: alpha=1.0, lambda=0.1 
## + Fold5: alpha=0.0, lambda=0.1 
## - Fold5: alpha=0.0, lambda=0.1 
## + Fold5: alpha=0.5, lambda=0.1 
## - Fold5: alpha=0.5, lambda=0.1 
## + Fold5: alpha=1.0, lambda=0.1 
## - Fold5: alpha=1.0, lambda=0.1 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0, lambda = 0.0223 on full training set
# Testing Data
log_probs <- predict(log_model, newdata = x_test, type = "prob")[, "yes"]

# Convert probabilities to binary labels
log_preds <- ifelse(log_probs > 0.5, 1, 0)

print(c(log_model$best))
## $alpha
## [1] 0
## 
## $lambda
## [1] 0.0223
# Confusion matrix
conf_mat_log <- confusionMatrix(factor(log_preds), factor(y_test),positive = "1")
print(conf_mat_log)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 621 156
##          1  22 487
##                                         
##                Accuracy : 0.8616        
##                  95% CI : (0.8415, 0.88)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7232        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.7574        
##             Specificity : 0.9658        
##          Pos Pred Value : 0.9568        
##          Neg Pred Value : 0.7992        
##              Prevalence : 0.5000        
##          Detection Rate : 0.3787        
##    Detection Prevalence : 0.3958        
##       Balanced Accuracy : 0.8616        
##                                         
##        'Positive' Class : 1             
## 
# F1 Score
lr_precision <- conf_mat_log$byClass["Pos Pred Value"]
lr_recall <- conf_mat_log$byClass["Sensitivity"]
lr_f1_score <- 2 * ((lr_precision * lr_recall) / (lr_precision + lr_recall))
cat("F1 Score :", round(lr_f1_score, 4), "\n")
## F1 Score : 0.8455
# ROC and AUC
roc_curve_log <- roc(y_test, log_probs)
plot(roc_curve_log, col = "red", main = "ROC Curve - Logistic Regression")

print(roc_curve_log$auc)
## Area under the curve: 0.8545

XGboost Classification Model

# Hyperparamter tuning grid
xgb_grid <- expand.grid(
  nrounds = c(50, 100),
  max_depth = c(3, 6 ,10),
  eta = c(0.1, 0.3),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

# Training model
xgb_model <- train(
  label ~ .,
  data = x_train,
  method = "xgbTree",
  trControl = ctrl,
  tuneGrid = xgb_grid,
  metric = "ROC"
)
## + Fold1: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:34] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:35] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:36] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:37] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:38] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:39] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:49:40] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 50, max_depth = 10, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1 on full training set
print(c(xgb_model$best))
## $nrounds
## [1] 50
## 
## $max_depth
## [1] 10
## 
## $eta
## [1] 0.3
## 
## $gamma
## [1] 0
## 
## $colsample_bytree
## [1] 1
## 
## $min_child_weight
## [1] 1
## 
## $subsample
## [1] 1
# Testing model
xgc_pred_probs <-predict(xgb_model, x_test, type = "prob")[, "yes"]
xgc_pred_labels <-ifelse(xgc_pred_probs > 0.5, 1, 0)

# Confusion Matrix
conf_mat <-confusionMatrix(factor(xgc_pred_labels), factor(y_test),positive = "1")
print(conf_mat)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 618  23
##          1  25 620
##                                           
##                Accuracy : 0.9627          
##                  95% CI : (0.9508, 0.9724)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9253          
##                                           
##  Mcnemar's Test P-Value : 0.8852          
##                                           
##             Sensitivity : 0.9642          
##             Specificity : 0.9611          
##          Pos Pred Value : 0.9612          
##          Neg Pred Value : 0.9641          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4821          
##    Detection Prevalence : 0.5016          
##       Balanced Accuracy : 0.9627          
##                                           
##        'Positive' Class : 1               
## 
# F1 Score
xgc_precision <- conf_mat$byClass["Pos Pred Value"]
xgc_recall <- conf_mat$byClass["Sensitivity"]
xgc_f1_score <- 2 * ((xgc_precision * xgc_recall) / (xgc_precision + xgc_recall))
cat("F1 Score :", round(xgc_f1_score,4))
## F1 Score : 0.9627
# ROC Curve
roc_curve <- roc(y_test, xgc_pred_probs)
plot(roc_curve,  col= "blue", main= "ROC Curve - XGBoost")

print(roc_curve$auc)
## Area under the curve: 0.9918

Classification Evaluation Summary

name <- c("Logistic Regression","XGboost Classification")
pres <- c(lr_precision,xgc_precision)
recall <- c(lr_recall,xgc_recall)
f1 <- c(lr_f1_score,xgc_f1_score)
AUC_1 <- c(roc_curve_log$auc,roc_curve$auc)

cls_sum <- data.frame(
  Model = name,
  Precision = pres,
  Recall = recall,
  F1_Score = f1,
  AUC = AUC_1
)

cls_sum
##                    Model Precision    Recall  F1_Score       AUC
## 1    Logistic Regression 0.9567780 0.7573872 0.8454861 0.8545286
## 2 XGboost Classification 0.9612403 0.9642302 0.9627329 0.9918466

Use Case 2: Regression (Predict Transaction Amount)

Two algorithms were used for regression modeling:

  1. Linear Regression – an interpretable model that provides clear insights into the relationship between features and the target variable.
  2. XGBoost Regression – a black-box model known for its high predictive performance but limited interpretability.
# Selecting test and train data
x_train_reg <- df_train_bal %>% select(-"transaction_amount(USD)")
y_train_reg <- df_train_bal$"transaction_amount(USD)"

x_test_reg <- df_test_cls %>% select(-"transaction_amount(USD)")
y_test_reg <- df_test_cls$"transaction_amount(USD)"

# Fit scaler , apply to train and test
scaler <- preProcess(x_train_reg, method = "range") # default is [0, 1]
x_train_scaled <- predict(scaler, x_train_reg)
x_test_scaled  <- predict(scaler, x_test_reg)

Linear Regression

x_train_scaled$labels <- y_train_reg

ctrl <-trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE)

lin_model <-train(
  labels~ .,
  data = x_train_scaled,
  method = "lm",
  trControl = ctrl,
  metric = "RMSE"  
)
## + Fold1: intercept=TRUE 
## - Fold1: intercept=TRUE 
## + Fold2: intercept=TRUE 
## - Fold2: intercept=TRUE 
## + Fold3: intercept=TRUE 
## - Fold3: intercept=TRUE 
## + Fold4: intercept=TRUE 
## - Fold4: intercept=TRUE 
## + Fold5: intercept=TRUE 
## - Fold5: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set
# Predict on test set
lin_preds <- predict(lin_model, newdata = x_test_scaled)

# Evaluate
rmse_lin <- RMSE(y_test_reg, lin_preds)
r2_lin <- R2(lin_preds, y_test_reg)
mae_lin <- MAE(lin_preds, y_test_reg)

list(RMSE = rmse_lin, R2 = r2_lin , mae = mae_lin)
## $RMSE
## [1] 264.0571
## 
## $R2
## [1] 0.5154375
## 
## $mae
## [1] 209.8559

XGBoost Regression

x_train_scaled$labels <- y_train_reg

ctrl <-trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE)

xgb_grid <- expand.grid(
  nrounds = c(50, 100),
  max_depth = c(3, 6 ,10),
  eta = c(0.1, 0.3),
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

# 4. Train XGBoost regressor
  xgb_reg <-train(
  labels~ .,
  data = x_train_scaled,
  method = "xgbTree",
  trControl = ctrl,
  tuneGrid = xgb_grid,
  metric = "RMSE"
)
## + Fold1: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:41] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:41] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:42] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:43] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:43] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold1: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:44] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold1: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:44] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:44] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:45] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:45] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:45] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold2: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:46] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold2: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:46] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:46] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:47] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:47] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:47] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold3: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:48] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold3: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:48] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:49] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:49] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:50] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:50] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold4: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:51] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold4: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:51] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.1, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:51] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.1, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:52] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.1, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:52] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.3, max_depth= 3, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:52] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.3, max_depth= 6, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## + Fold5: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## [15:49:53] WARNING: src/c_api/c_api.cc:935: `ntree_limit` is deprecated, use `iteration_range` instead.
## - Fold5: eta=0.3, max_depth=10, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=100 
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 50, max_depth = 6, eta = 0.1, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1 on full training set
print(c(xgb_reg$best))
## $nrounds
## [1] 50
## 
## $max_depth
## [1] 6
## 
## $eta
## [1] 0.1
## 
## $gamma
## [1] 0
## 
## $colsample_bytree
## [1] 1
## 
## $min_child_weight
## [1] 1
## 
## $subsample
## [1] 1
# Predict & evaluate
pred_probs <- predict(xgb_reg,x_test_scaled)

rmse_xg <- RMSE(pred_probs , y_test)
r2_xg <- R2(pred_probs, y_test_reg)
mae_xg <- MAE(pred_probs, y_test_reg)

list(RMSE = rmse_xg, R2 = r2_xg, mae = mae_xg)
## $RMSE
## [1] 467.3774
## 
## $R2
## [1] 0.9092982
## 
## $mae
## [1] 55.14382

Regression Evaluation Summary

name <- c("Linear Regression","XGboost Regression")
rmse_ <- c(rmse_lin,rmse_xg)
r2_ <- c(r2_lin,r2_xg)
mae_ <- c(mae_lin,mae_xg)

reg_sum <- data.frame(
  Model = name,
  RMSE = rmse_,
  MAE = mae_,
  R2 = r2_
)

reg_sum
##                Model     RMSE       MAE        R2
## 1  Linear Regression 264.0571 209.85593 0.5154375
## 2 XGboost Regression 467.3774  55.14382 0.9092982

Discussion of Outputs

Classification – Fraud Detection

Two models were developed: Logistic Regression and XGBoost Classifier.
The dataset was heavily imbalanced, so undersampling was applied to balance the classes.

Evaluation Metrics: - XGBoost achieved higher accuracy, sensitivity, and F1-score compared to Logistic Regression. - Logistic Regression had slightly lower metrics but was more interpretable.

These results suggest that XGBoost is better at capturing complex fraud patterns, but Logistic Regression can still be useful in scenarios where explainability is essential (e.g., compliance, audit trails).

Confusion Matrix Insights: - Both models had a higher true negative rate due to the class imbalance. - XGBoost improved the true positive rate (fraud detection), reducing false negatives — which is critical in fraud prevention.


Regression – Predicting Transaction Amount

Both Linear Regression and XGBoost Regression models were trained to predict transaction amounts.

Performance Summary:

Model RMSE MAE
Linear Regression 264.06 209.86 0.515
XGBoost Regression 467.38 55.14 0.909
  • XGBoost Regression had much higher , meaning it explained over 90% of the variance in transaction amounts.
  • It also had significantly lower MAE, indicating more accurate average predictions.
  • Linear Regression, although more interpretable, underperformed in capturing the non-linear relationships.

Interpretation: - XGBoost excelled in predictive performance for both use cases, but at the cost of reduced interpretability. - For business applications, the choice between models depends on whether the priority is accuracy or explainability.


Overall

The outputs highlight the strength of tree-based models like XGBoost in handling real-world financial data. With proper preprocessing and validation, these models offer high predictive power for both classification and regression tasks. ## Conclusion

This project explored the use of machine learning models for both fraud classification and transaction amount prediction.

Classification: Logistic Regression and XGBoost were evaluated. XGBoost achieved higher accuracy and AUC, while Logistic Regression provided better interpretability.

Regression: XGBoost Regression outperformed Linear Regression in terms of R² and MAE, although RMSE was higher, likely due to sensitivity to large errors.

Key Findings: Features such as transaction amount, age, and merchant category had strong predictive power. Fraud tends to occur more frequently in high-value transactions and online categories.

Challenges: Severe class imbalance was addressed using undersampling. Future work could include SMOTE or cost-sensitive learning.

Future Improvements: - Use of SHAP/LIME for model explainability - Ensemble stacking of classifiers - Real-time fraud scoring with deployment pipelines

Overall, machine learning offers promising results in both detecting fraud and understanding transaction behaviors, supporting risk mitigation strategies in financial services.