E-commerce Fraud Prevention

1. Introduction

In the fast-evolving e-commerce environment of today, guaranteeing safe and reliable transactions is more essential than before. With the rise in online transaction volumes, the likelihood of fraudulent activities also escalates. Fraud leads to monetary losses for both companies and customers while also diminishing consumer confidence. Identifying and stopping fraudulent transactions instantly is crucial to uphold the integrity of e-commerce platforms and protect users’ private information.

1.1 Objectives

To identify the key features and patterns that distinguish fraudulent from legitimate transactions.
To predict fraudulent transactions using machine learning algorithms.
To evaluate model performance using accuracy, precision and recall metrics.

1.2 Data Description

Data source: https://www.kaggle.com/datasets/aryanrastogi7767/ecommerce-fraud-data

The dataset we utilized consists of two CSV files, where one contains CUSTOMER CONTACT DETAILS with 168 records and 10 attribute. Another file includes CUSTOMER TRANSACTION DETAIL with a total of 623 records and 11 attributes. The Customer Transaction Detail file provides a broader picture of customer purchasing behaviour, while the Customer Contact Details file offers insights into how to the customers making their purchase.

2. Data Pre-processing

#load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
                                                                                                                         
#load dataset
Customer_Tran<-read.csv("cust_transaction_details.csv", header = T, sep = ",")
Customer_Contact<-read.csv("Customer_DF.csv", header = T, sep = ",")

#to check the class values of the dataset
class(Customer_Contact)

## [1] "data.frame"

class(Customer_Tran)

## [1] "data.frame"

#to view every attribute with Data Type in df
glimpse(Customer_Contact)

## Rows: 168
## Columns: 10
## $ X                      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ customerEmail          <chr> "josephhoward@yahoo.com", "evansjeffery@yahoo.c…
## $ customerPhone          <chr> "400-108-5415", "1-788-091-7546", "024.420.0375…
## $ customerDevice         <chr> "yyeiaxpltf82440jnb3v", "r0jpm7xaeqqa3kr6mzum",…
## $ customerIPAddress      <chr> "8.129.104.40", "219.173.211.202", "67b7:3db8:6…
## $ customerBillingAddress <chr> "5493 Jones Islands\nBrownside, CA 51896", "356…
## $ No_Transactions        <int> 2, 3, 5, 3, 7, 1, 2, 6, 5, 0, 6, 7, 4, 4, 5, 4,…
## $ No_Orders              <int> 2, 3, 3, 3, 7, 1, 1, 5, 2, 0, 5, 5, 4, 2, 5, 4,…
## $ No_Payments            <int> 1, 7, 2, 1, 6, 2, 2, 2, 1, 1, 1, 4, 3, 1, 2, 1,…
## $ Fraud                  <chr> "False", "True", "False", "False", "True", "Tru…

glimpse(Customer_Tran)

## Rows: 623
## Columns: 11
## $ X                                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,…
## $ customerEmail                    <chr> "josephhoward@yahoo.com", "josephhowa…
## $ transactionId                    <chr> "a9lcj51r", "y4wcv03i", "5mi94sfw", "…
## $ orderId                          <chr> "vjbdvd", "yp6x27", "nlghpa", "uw0eeb…
## $ paymentMethodId                  <chr> "wt07xm68b", "wt07xm68b", "41ug157xz"…
## $ paymentMethodRegistrationFailure <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ paymentMethodType                <chr> "card", "card", "bitcoin", "bitcoin",…
## $ paymentMethodProvider            <chr> "JCB 16 digit", "JCB 16 digit", "Amer…
## $ transactionAmount                <int> 18, 26, 45, 23, 43, 33, 24, 24, 25, 2…
## $ transactionFailed                <int> 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
## $ orderState                       <chr> "pending", "fulfilled", "fulfilled", …

#Check for missing values in each column
colSums(is.na(Customer_Contact))

##                      X          customerEmail          customerPhone 
##                      0                      0                      0 
##         customerDevice      customerIPAddress customerBillingAddress 
##                      0                      0                      0 
##        No_Transactions              No_Orders            No_Payments 
##                      0                      0                      0 
##                  Fraud 
##                      0

colSums(is.na(Customer_Tran))

##                                X                    customerEmail 
##                                0                                0 
##                    transactionId                          orderId 
##                                0                                0 
##                  paymentMethodId paymentMethodRegistrationFailure 
##                                0                                0 
##                paymentMethodType            paymentMethodProvider 
##                                0                                0 
##                transactionAmount                transactionFailed 
##                                0                                0 
##                       orderState 
##                                0

#drop unnecessary column
Customer_Contact <- Customer_Contact %>% select(-X, -customerPhone, -customerDevice)
Customer_Tran <- Customer_Tran %>% select(-X, -transactionId, -orderId, -paymentMethodId)


#change Fraud TRUE FALSE records to binary 
Customer_Contact$Fraud <- ifelse(Customer_Contact$Fraud,1,0)
glimpse(Customer_Contact)

## Rows: 168
## Columns: 7
## $ customerEmail          <chr> "josephhoward@yahoo.com", "evansjeffery@yahoo.c…
## $ customerIPAddress      <chr> "8.129.104.40", "219.173.211.202", "67b7:3db8:6…
## $ customerBillingAddress <chr> "5493 Jones Islands\nBrownside, CA 51896", "356…
## $ No_Transactions        <int> 2, 3, 5, 3, 7, 1, 2, 6, 5, 0, 6, 7, 4, 4, 5, 4,…
## $ No_Orders              <int> 2, 3, 3, 3, 7, 1, 1, 5, 2, 0, 5, 5, 4, 2, 5, 4,…
## $ No_Payments            <int> 1, 7, 2, 1, 6, 2, 2, 2, 1, 1, 1, 4, 3, 1, 2, 1,…
## $ Fraud                  <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,…

#count unique emails in Customer_Contact
unique_emails_contact <- length(unique(Customer_Contact$customerEmail))
cat("Unique emails in Customer_Contact:", unique_emails_contact, "\n")

## Unique emails in Customer_Contact: 161

#count unique emails in Customer_Tran
unique_emails_tran <- length(unique(Customer_Tran$customerEmail))
cat("Unique emails in Customer_Tran:", unique_emails_tran, "\n")

## Unique emails in Customer_Tran: 136

unique(Customer_Tran$paymentMethodType)

## [1] "card"      "bitcoin"   "apple pay" "paypal"

unique(Customer_Tran$paymentMethodProvider)

##  [1] "JCB 16 digit"                "American Express"           
##  [3] "VISA 16 digit"               "Discover"                   
##  [5] "Voyager"                     "VISA 13 digit"              
##  [7] "Maestro"                     "Mastercard"                 
##  [9] "Diners Club / Carte Blanche" "JCB 15 digit"

#to view number of times each email repeated in customer_contact

result <- list()
for (i in unique(Customer_Contact$customerEmail)) {
  repeat_count <- sum(Customer_Contact$customerEmail == i)
  if (repeat_count > 1) {
    result[[i]] <- repeat_count
  }
}
print(result)

## $`johnlowery@gmail.com`
## [1] 8

#filter records for the email "johnlowery@gmail.com"
repeated <- Customer_Contact[Customer_Contact$customerEmail == "johnlowery@gmail.com", ]
print(repeated)

##            customerEmail                       customerIPAddress
## 8   johnlowery@gmail.com                          212.144.68.190
## 41  johnlowery@gmail.com 6c21:ac1d:2089:68fa:abb7:8c00:525f:6588
## 46  johnlowery@gmail.com                          222.79.159.140
## 66  johnlowery@gmail.com  42b3:df19:86fe:abd9:dafe:f6c1:eb76:c72
## 80  johnlowery@gmail.com                          163.128.139.42
## 134 johnlowery@gmail.com  e4c:fb48:8ee2:9819:6ae8:8d3f:3b6a:a788
## 156 johnlowery@gmail.com f259:657f:f329:2fca:c06c:8b57:d6ac:2380
## 166 johnlowery@gmail.com  f82c:811f:8a02:e2d6:79b:fcaa:42de:570b
##                                       customerBillingAddress No_Transactions
## 8           484 Pamela Pass\nLake Jessicaview, WI 12942-9074               6
## 41  08238 Kyle Squares Suite 893\nMillermouth, IN 27535-5397               0
## 46                   77711 Pamela Ridge\nNew Kayla, IL 27182               3
## 66  11704 Andrew Villages Apt. 035\nJamesfurt, OR 49817-0470               7
## 80                   814 Wagner Union\nAshleymouth, HI 35617               2
## 134        518 Wood Mews Apt. 970\nDillonstad, NE 43317-3945               0
## 156       687 Rogers Bridge Suite 780\nValdezburgh, IN 23532               6
## 166                 548 Bryant Inlet\nVeronicaside, OK 00522               4
##     No_Orders No_Payments Fraud
## 8           5           2     1
## 41          0           1     1
## 46          2           1     1
## 66          5           1     1
## 80          2           1     1
## 134         0           0     1
## 156         5           3     1
## 166         4           2     1

#check matching email in both dataset
common <- 0

for (i in Customer_Contact$customerEmail) {
  for (email in Customer_Tran$customerEmail) {
    if (i == email) {
      common <- common + 1
      break
    }
  }
}

common #143 common emails

## [1] 143

#maintain the email that matches in both dataset. Maintaining the attributes of Customer tran
dataset <- Customer_Contact[Customer_Contact$customerEmail %in% Customer_Tran$customerEmail, ]
# To see the shape of the data frame (number of rows and columns)
dim(dataset)

## [1] 143   7

glimpse(dataset)

## Rows: 143
## Columns: 7
## $ customerEmail          <chr> "josephhoward@yahoo.com", "evansjeffery@yahoo.c…
## $ customerIPAddress      <chr> "8.129.104.40", "219.173.211.202", "67b7:3db8:6…
## $ customerBillingAddress <chr> "5493 Jones Islands\nBrownside, CA 51896", "356…
## $ No_Transactions        <int> 2, 3, 5, 3, 7, 1, 2, 6, 5, 6, 7, 4, 4, 5, 4, 6,…
## $ No_Orders              <int> 2, 3, 3, 3, 7, 1, 1, 5, 2, 5, 5, 4, 2, 5, 4, 3,…
## $ No_Payments            <int> 1, 7, 2, 1, 6, 2, 2, 2, 1, 1, 4, 3, 1, 2, 1, 2,…
## $ Fraud                  <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0,…

3. EDA

3.1 Fraud vs Non-Fraud

fraud_summary <- Customer_Contact %>%
  group_by(Fraud) %>%
  summarise(count = n())  # Calculate the count of fraud vs non-fraud

# Plot the Fraud vs Non-Fraud bar chart with counts
ggplot(fraud_summary, aes(x=factor(Fraud), y=count, fill=factor(Fraud))) +
  geom_bar(stat="identity", show.legend=FALSE) +
  geom_text(aes(label=sprintf("%d", count)), vjust=-0.5) +  # Display the count instead of percentage
  labs(title="Fraud vs Non-Fraud", x="Fraud", y="Count") +
  scale_fill_manual(values = c("0" = "#66c2a5", "1" = "#006400")) +  # Apply custom colors for False (0) and True (1)
  scale_x_discrete(labels = c("0" = "False", "1" = "True")) +  # Change axis labels to False and True
  theme_minimal()

There are 61 cases of fraudulent which make the data slighlty imbalanced but not severe

3.2 Payment Method Type Usage Percentage

payment_method_usage <- Customer_Tran %>%
  group_by(paymentMethodType) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

# Plot the Payment Method Type Usage Percentage bar chart
ggplot(payment_method_usage, aes(x=paymentMethodType, y=percentage, fill=paymentMethodType)) +
  geom_bar(stat='identity') +
  geom_text(aes(label=sprintf("%.1f%%", percentage)), vjust=-0.5) +  # Add percentage labels
  labs(title="Payment Method Type Usage Percentage", x="Payment Method Type", y="Percentage") +
  scale_fill_brewer(palette="Dark1") +  # Apply a different color palette
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

## Warning: Unknown palette: "Dark1"

The data shows that the most commonly used payment method is card payments, accounting for 76.9%. The usage percentages of the other three payment methods—Apple Pay, Bitcoin, and PayPal—are relatively similar.

3.3 Payment Method Provider vs Method Type

payment_provider_usage <- Customer_Tran %>%
  group_by(paymentMethodType, paymentMethodProvider) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

## `summarise()` has grouped output by 'paymentMethodType'. You can override using
## the `.groups` argument.

# Plot Payment Method Provider vs Method Type bar chart (separate bars)
ggplot(payment_provider_usage, aes(x = paymentMethodType, y = percentage, fill = paymentMethodProvider)) +
  geom_bar(stat = 'identity', position = 'dodge') +  # Separate bars for each provider
  geom_text(aes(label = sprintf("%.1f", percentage)), position = position_dodge(width = 0.9), vjust = -0.5) +  # Add percentage labels
  labs(title = "Payment Method Provider vs Method Type", x = "Payment Method Type", y = "Percentage") +
  scale_fill_brewer(palette = "Set4") +  # Apply a color palette
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Unknown palette: "Set4"

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Greens is 9
## Returning the palette you asked for with that many colors

The distribution of payment methods varies across providers. For example, JCB 15-Digit primarily offers PayPal, Voyager focuses on Bitcoin, while most providers evenly support Apple Pay and Card.

3.4 Count of Order State vs Payment Method Type

order_state_payment_method <- Customer_Tran %>%
  group_by(orderState, paymentMethodType) %>%
  summarise(count = n())

## `summarise()` has grouped output by 'orderState'. You can override using the
## `.groups` argument.

# Plot Count of Order State vs Payment Method Type
ggplot(order_state_payment_method, aes(x=paymentMethodType, y=count, fill=orderState)) +
  geom_bar(stat='identity', position="dodge") +  # Dodge bar chart
  geom_text(aes(label=count), position=position_dodge(0.8), vjust=-0.2) +  # Add count labels
  labs(title="Count of Order State vs Payment Method Type", x="Payment Method Type", y="Count") +
  scale_fill_brewer(palette="Set4") +  # Apply color palette
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels

## Warning: Unknown palette: "Set4"

Successful transactions are predominantly card payments (as high as 397), while failed and pending transactions are fewer but also mostly card-based, highlighting the wide coverage of card payments.

3.5 Count of Payment Method Provider vs Method Registration Failure

provider_failure_summary <- Customer_Tran %>%
  group_by(paymentMethodProvider, paymentMethodRegistrationFailure) %>%
  summarise(count = n(), .groups = "drop")

# Plot grouped bar chart with green-based colors and numeric labels
ggplot(provider_failure_summary, aes(x = paymentMethodProvider, y = count, fill = factor(paymentMethodRegistrationFailure))) +
  geom_bar(stat = "identity", position = "dodge") +  # Grouped bar chart
  geom_text(aes(label = count), position = position_dodge(0.9), vjust = -0.5, size = 4) +  # Add numeric labels
  scale_fill_manual(values = c("#66c2a5", "#006400"),  # Light green for success, dark green for failure
                    labels = c("Success", "Failure")) +  # Add legend labels
  labs(
    title = "Registration Success vs Failure by Provider",
    x = "Payment Method Provider",
    y = "Count",
    fill = "Status"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    legend.position = "top"  # Position legend at the top
  )

JCB 16-Digit and Visa 16-Digit have the highest success rates in registration, with JCB 15-Digit, Maestro, and Voyager showing no failures. Overall, providers with the lowest failure rates include JCB 16-Digit, Visa 16-Digit, and Maestro.

3.6 Payment Method Provider vs Transaction Failed

# Summarize counts of transaction success and failure for each payment provider
transaction_failure_summary <- Customer_Tran %>%
  group_by(paymentMethodProvider, transactionFailed) %>%
  summarise(count = n()) %>%
  mutate(
    Status = ifelse(transactionFailed == 1, "Failed", "Success")
  )

## `summarise()` has grouped output by 'paymentMethodProvider'. You can override
## using the `.groups` argument.

# Plot grouped bar chart for transaction failures vs providers
ggplot(transaction_failure_summary, aes(x = paymentMethodProvider, y = count, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +  # Grouped bar chart
  geom_text(aes(label = count), position = position_dodge(0.9), vjust = -0.5, size = 3) +  # Add count labels
  labs(
    title = "Transaction Success vs Failure by Payment Provider",
    x = "Payment Method Provider",
    y = "Count"
  ) +
  scale_fill_manual(values = c("Success" = "#66c2a5", "Failed" = "#006400")) +  # Green for success, orange for failure
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

JCB 16-Digit and Visa 16-Digit have the highest success rates in transactions, while some providers, such as Diners Club, exhibit higher failure rates. Since some percentage data does not accurately reflect the true proportions, it is recommended to continue observing and comparing trends using bar charts.

3.7 Count of No Payments vs Fraud

# Group by No_Payments and Fraud, then count the records
no_payment_count <- Customer_Contact %>%
  group_by(No_Payments, Fraud) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(Fraud = ifelse(Fraud == 1, "False", "True"))

# Ensure No_Payments has all values from 1 to 15, even if some values are missing
no_payment_count <- no_payment_count %>%
  complete(No_Payments = 1:15, Fraud, fill = list(count = 0))

# Create the plot
ggplot(no_payment_count, aes(x = factor(No_Payments), y = count, fill = Fraud)) +
  geom_bar(stat = "identity", position = "dodge") +  # Grouped bar chart
  geom_text(aes(label = count), position = position_dodge(0.9), vjust = -0.5, size = 3) +  # Add count labels
  labs(
    title = "Count of No Payments vs Fraud",
    x = "No Payments",
    y = "Count"
  ) +
  scale_fill_manual(values = c("False" = "#66c2a5", "True" = "#006400")) +  # Green for False, dark green for True
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 1))

Users with fewer unpaid orders (0-3) are more likely to experience fraud, especially those with 1 or 2 unpaid orders at the peak. In contrast, users with more unpaid orders (5-15) rarely face fraud. This suggests that fraudsters tend to target users with regular consumer behavior.

4. Data Transformation

In this task, we are using One-hot encoding technique which converts categorical data into a numerical format that models can process. This technique transforms each category into a separate binary column, treated as independent and unrelated entities, preserving the true nature of the data.

4.1 Merging total_transaction_amt column

Total_transaction_amt <- numeric(143)
#loop through each customerEmail in dataset
for (i in 1:143) {
  s <- 0
  for (j in 1:623) {
    if (dataset$customerEmail[i] == Customer_Tran$customerEmail[j]) {
      s <- s + Customer_Tran$transactionAmount[j]
    }
  }
  #store the total transaction amount for each customer in the dataset
  Total_transaction_amt[i] <- s
}

#add the calculated Total_transaction_amt as a new column in the dataset
dataset$Total_transaction_amt <- Total_transaction_amt
view(dataset)

4.2 Merging transactionfailed column

No_transactionfail <- numeric(143)

#loop through each customerEmail in dataset
for (i in 1:143) {
  s <- 0
  for (j in 1:623) {
    if (dataset$customerEmail[i] == Customer_Tran$customerEmail[j]) {
      s <- s + Customer_Tran$transactionFailed[j]
    }
  }
  #store the total transactionfailed for each customer in the dataset
  No_transactionfail[i] <- s
}

#add the calculated transactionfailed as a new column in the dataset
dataset$No_transactionfail <- No_transactionfail
view(dataset)

4.3 Merging paymentMethodRegistrationFailure column

Payment_Method_Reg_Fail <- numeric(143)

# Loop through each customerEmail in dataset
for (i in 1:143) {
  s <- 0
  for (j in 1:623) {
    if (dataset$customerEmail[i] == Customer_Tran$customerEmail[j]) {
      s <- s + Customer_Tran$paymentMethodRegistrationFailure[j]
    }
  }
  # Store the total  paymentMethodRegistrationFailure for each customer in the dataset
  Payment_Method_Reg_Fail[i] <- s
}

#add the calculated  paymentMethodRegistrationFailure as a new column in the dataset
dataset$Payment_Method_Reg_Fail <- Payment_Method_Reg_Fail
view(dataset)

4.4 Merging each payment type column

#function to count the number of payments of a specific category for each email
col_make <- function(column_name, category) {
  array <- numeric(143)  
  
  for (i in 1:143) {
    s <- 0
    for (j in 1:623) {
      if (dataset$customerEmail[i] == Customer_Tran$customerEmail[j]) {
        if (Customer_Tran[[column_name]][j] == category) {
          s <- s + 1
        }
      }
    }
    array[i] <- s
  }
  
  return(array)
}

#call the function for each payment type
PaypalPayments <- col_make('paymentMethodType', 'paypal')
ApplePayments <- col_make('paymentMethodType', 'apple pay')
BitcoinPayments <- col_make('paymentMethodType', 'bitcoin')
CardPayments <- col_make('paymentMethodType', 'card')

#add the new columns to the final data frame
dataset$PaypalPayments <- PaypalPayments
dataset$ApplePayments <- ApplePayments
dataset$CardPayments <- CardPayments
dataset$BitcoinPayments <- BitcoinPayments

4.5 Count the number of payments of a specific category for each email

#count of category within a column of the Customer_Tran dataset for each customer email
col_make <- function(column_name, category) {
  array <- numeric(143) 
  
  for (i in 1:143) {
    s <- 0
    for (j in 1:623) {
      if (dataset$customerEmail[i] == Customer_Tran$customerEmail[j]) {
        if (Customer_Tran[[column_name]][j] == category) {
          s <- s + 1
        }
      }
    }
    array[i] <- s
  }
  
  return(array)
}

glimpse(dataset)

## Rows: 143
## Columns: 14
## $ customerEmail           <chr> "josephhoward@yahoo.com", "evansjeffery@yahoo.…
## $ customerIPAddress       <chr> "8.129.104.40", "219.173.211.202", "67b7:3db8:…
## $ customerBillingAddress  <chr> "5493 Jones Islands\nBrownside, CA 51896", "35…
## $ No_Transactions         <int> 2, 3, 5, 3, 7, 1, 2, 6, 5, 6, 7, 4, 4, 5, 4, 6…
## $ No_Orders               <int> 2, 3, 3, 3, 7, 1, 1, 5, 2, 5, 5, 4, 2, 5, 4, 3…
## $ No_Payments             <int> 1, 7, 2, 1, 6, 2, 2, 2, 1, 1, 4, 3, 1, 2, 1, 2…
## $ Fraud                   <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0…
## $ Total_transaction_amt   <dbl> 44, 111, 131, 85, 411, 25, 80, 1033, 109, 167,…
## $ No_transactionfail      <dbl> 0, 1, 2, 0, 2, 0, 1, 6, 3, 2, 3, 0, 2, 0, 0, 3…
## $ Payment_Method_Reg_Fail <dbl> 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PaypalPayments          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ApplePayments           <dbl> 0, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 4, 0…
## $ CardPayments            <dbl> 2, 0, 5, 3, 4, 1, 0, 24, 5, 6, 6, 3, 4, 5, 0, …
## $ BitcoinPayments         <dbl> 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 3…

4.6 Merging each payment method provider column

OrdersFulfilled <- col_make('orderState', 'fulfilled')
OrdersFailed <- col_make('orderState', 'failed')
OrdersPending <- col_make('orderState', 'pending')

dataset$OrdersFulfilled <- OrdersFulfilled
dataset$OrdersFailed <- OrdersFailed
dataset$OrdersPending <- OrdersPending

4.7 Merging payment methods by provider

JCB_16 <- col_make('paymentMethodProvider', 'JCB 16 digit')
AmericanExp <- col_make('paymentMethodProvider', 'American Express')
VISA_16 <- col_make('paymentMethodProvider', 'VISA 16 digit')
Discover <- col_make('paymentMethodProvider', 'Discover')
Voyager <- col_make('paymentMethodProvider', 'Voyager')
VISA_13 <- col_make('paymentMethodProvider', 'VISA 13 digit')
Maestro <- col_make('paymentMethodProvider', 'Maestro')
Mastercard <- col_make('paymentMethodProvider', 'Mastercard')
DC_CB <- col_make('paymentMethodProvider', 'Diners Club / Carte Blanche')
JCB_15 <- col_make('paymentMethodProvider', 'JCB 15 digit')

# Add the new payment provider columns to the final data frame
dataset$JCB_16 <- JCB_16
dataset$AmericanExp <- AmericanExp
dataset$VISA_16 <- VISA_16
dataset$Discover <- Discover
dataset$Voyager <- Voyager
dataset$VISA_13 <- VISA_13
dataset$Maestro <- Maestro
dataset$Mastercard <- Mastercard
dataset$DC_CB <- DC_CB
dataset$JCB_15 <- JCB_15
glimpse(dataset)

## Rows: 143
## Columns: 27
## $ customerEmail           <chr> "josephhoward@yahoo.com", "evansjeffery@yahoo.…
## $ customerIPAddress       <chr> "8.129.104.40", "219.173.211.202", "67b7:3db8:…
## $ customerBillingAddress  <chr> "5493 Jones Islands\nBrownside, CA 51896", "35…
## $ No_Transactions         <int> 2, 3, 5, 3, 7, 1, 2, 6, 5, 6, 7, 4, 4, 5, 4, 6…
## $ No_Orders               <int> 2, 3, 3, 3, 7, 1, 1, 5, 2, 5, 5, 4, 2, 5, 4, 3…
## $ No_Payments             <int> 1, 7, 2, 1, 6, 2, 2, 2, 1, 1, 4, 3, 1, 2, 1, 2…
## $ Fraud                   <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0…
## $ Total_transaction_amt   <dbl> 44, 111, 131, 85, 411, 25, 80, 1033, 109, 167,…
## $ No_transactionfail      <dbl> 0, 1, 2, 0, 2, 0, 1, 6, 3, 2, 3, 0, 2, 0, 0, 3…
## $ Payment_Method_Reg_Fail <dbl> 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PaypalPayments          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ApplePayments           <dbl> 0, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 4, 0…
## $ CardPayments            <dbl> 2, 0, 5, 3, 4, 1, 0, 24, 5, 6, 6, 3, 4, 5, 0, …
## $ BitcoinPayments         <dbl> 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 3…
## $ OrdersFulfilled         <dbl> 1, 3, 4, 3, 6, 0, 2, 25, 1, 4, 7, 3, 4, 3, 4, …
## $ OrdersFailed            <dbl> 0, 0, 1, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 2, 0, 0…
## $ OrdersPending           <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 4, 1, 0, 1, 0, 0, 0, 0…
## $ JCB_16                  <dbl> 2, 0, 4, 0, 3, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 6…
## $ AmericanExp             <dbl> 0, 2, 0, 0, 1, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0…
## $ VISA_16                 <dbl> 0, 1, 0, 0, 2, 1, 0, 10, 0, 0, 0, 4, 0, 5, 4, …
## $ Discover                <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 5, 6, 2, 0, 0, 0, 0, 0…
## $ Voyager                 <dbl> 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0…
## $ VISA_13                 <dbl> 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 0, 0, 0, 0…
## $ Maestro                 <dbl> 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Mastercard              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0…
## $ DC_CB                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JCB_15                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

4.8 Correlation Heatmap

numeric_columns <- dataset[, sapply(dataset, is.numeric)]
numeric_columns <- numeric_columns[, !colnames(numeric_columns) %in% "X"]
correlation_matrix <- cor(numeric_columns, use = "complete.obs")
fraud_correlation <- correlation_matrix["Fraud", ]
fraud_correlation <- fraud_correlation[!names(fraud_correlation) %in% "Fraud"]  # Exclude self-correlation
sorted_correlation <- sort(fraud_correlation, decreasing = TRUE)
correlation_df <- data.frame(
  Feature = names(sorted_correlation),
  Correlation = sorted_correlation
)

correlation_df$Feature <- factor(correlation_df$Feature, levels = correlation_df$Feature)

# Create the horizontal heatmap
ggplot(correlation_df, aes(x = "Fraud", y = Feature, fill = Correlation)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "red",
    midpoint = 0, limit = c(-1, 1), space = "Lab",
    name = "Correlation"
  ) +
  geom_text(aes(label = sprintf("%.2f", Correlation)), color = "black") +  # Annotate correlation values
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  ) +
  labs(
    title = "Correlation with Fraud",
    x = "",
    y = "Features"
  )

Total_transaction_amt indicates that higher transaction amounts are strongly associated with fraudulent behavior.
No_Payments shows that more payments per customer are correlated with fraud.
Fraud is slightly more associated with fulfilled orders, though this is less intuitive and may warrant further investigation.
Payment_Method_Reg_Fail and Mastercard features have little to no relationship with fraud, indicating they may not be critical for predictive modeling.

5. Data Modelling

#load library for all model
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(rpart)
library(rpart.plot)

5.1 How can we classify transactions as fraudulent or legitimate based on transaction attributes, customer behavior, and payment methods?

To answer the classification use cases, we will be using SVM and Decision Tree model.

5.1.1 SVM Model

modelset <- dataset[, names(dataset) %in% c("No_Transactions", "No_Orders", "No_Payments", "Total_transaction_amt", "No_transactionfail")]
Y <- dataset[, "Fraud"]

#Normalize the data using caret package with min-max method
preProc <- preProcess(modelset, method = "range")

modelset_scaled <- predict(preProc, modelset)

#add target variable data again
modelset_scaled$Target <- Y
modelset_scaled$Target <- factor(modelset_scaled$Target)

#splitting dataset into training and test set
set.seed(123)

trainIndex <- createDataPartition(Y, p = 0.7, list=F)

trainSet <- modelset_scaled[trainIndex, ]
testSet <-  modelset_scaled[-trainIndex, ]

#create accuracy function
accuracy <- function(matrix){
  a <- (matrix[1,1] + matrix[2,2])/(matrix[1,1] + matrix[2,2] + matrix[1,2] + matrix[2,1])
  return(a)
}

svm_model <- svm(Target ~ ., data = trainSet, kernel = "radial", cost = 1, gamma = 0.01)

summary(svm_model)

## 
## Call:
## svm(formula = Target ~ ., data = trainSet, kernel = "radial", cost = 1, 
##     gamma = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  79
## 
##  ( 39 40 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#creating evaluation function to make the evaluation easier
evaluate_model <- function(model, testdata){
  print(model)
  predicted_labels <- predict(model, newdata = testdata) #predict on new data
  confusion_matrix <- table(Predicted = predicted_labels, Actual = testdata$Target) #computing confusion matrix
  rec <- confusion_matrix[1,1]/(confusion_matrix[1,1] + confusion_matrix[1,2])
  prec <- confusion_matrix[1,1]/(confusion_matrix[1,1] + confusion_matrix[2,1])
  f1_score <- 2 * (prec * rec) / (prec + rec)
  print("Confusion Matrix")
  print(confusion_matrix)
  print(paste("The accuracy is", accuracy(confusion_matrix)))
  print(paste("The recall is", rec))
  print(paste("The precision is", prec))
  print(paste("The F1-score is", f1_score))
  
  
}
#Evaluating the model
evaluate_model(svm_model,testSet)

## 
## Call:
## svm(formula = Target ~ ., data = trainSet, kernel = "radial", cost = 1, 
##     gamma = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  79
## 
## [1] "Confusion Matrix"
##          Actual
## Predicted  0  1
##         0 25  9
##         1  1  7
## [1] "The accuracy is 0.761904761904762"
## [1] "The recall is 0.735294117647059"
## [1] "The precision is 0.961538461538462"
## [1] "The F1-score is 0.833333333333333"

Using different parameters in the SVM model of cost and gamma values

#Using different parameters in the SVM model of cost and gamma values
svm_model2 <- svm(Target ~ ., data = trainSet, kernel = "radial", cost = 5, gamma = 0.1)

summary(svm_model2)

## 
## Call:
## svm(formula = Target ~ ., data = trainSet, kernel = "radial", cost = 5, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  5 
## 
## Number of Support Vectors:  62
## 
##  ( 32 30 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

evaluate_model(svm_model2, testSet)

## 
## Call:
## svm(formula = Target ~ ., data = trainSet, kernel = "radial", cost = 5, 
##     gamma = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  5 
## 
## Number of Support Vectors:  62
## 
## [1] "Confusion Matrix"
##          Actual
## Predicted  0  1
##         0 24  4
##         1  2 12
## [1] "The accuracy is 0.857142857142857"
## [1] "The recall is 0.857142857142857"
## [1] "The precision is 0.923076923076923"
## [1] "The F1-score is 0.888888888888889"

#85.7% for the SVM classification number 2

#Other SVM model with more customization from caret library
#this uses 10-fold cross validation to pick out likely best sigma and C values

train_control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

# Train SVM model
svm_model3 <- train(
  Target ~ ., 
  data = trainSet, 
  method = "svmRadial", 
  trControl = train_control, 
  tuneGrid = expand.grid(.C = c(0.1, 1, 10, 50), .sigma = c(0.01, 0.1, 1))
)

# View the best model parameters
print(svm_model3$bestTune)

##    sigma  C
## 11   0.1 50

#best sigma and C it found are 0.01 and 10 respectively 


# View model performance
print(svm_model3)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 101 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 91, 90, 91, 91, 91, ... 
## Resampling results across tuning parameters:
## 
##   C     sigma  Accuracy   Kappa    
##    0.1  0.01   0.6036364  0.0000000
##    0.1  0.10   0.6036364  0.0000000
##    0.1  1.00   0.6036364  0.0000000
##    1.0  0.01   0.6927273  0.2531639
##    1.0  0.10   0.7409091  0.3900678
##    1.0  1.00   0.7209091  0.3673029
##   10.0  0.01   0.7409091  0.3900678
##   10.0  0.10   0.7809091  0.4916112
##   10.0  1.00   0.6818182  0.3187291
##   50.0  0.01   0.7409091  0.3900678
##   50.0  0.10   0.7909091  0.5173029
##   50.0  1.00   0.6718182  0.2954150
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1 and C = 50.

# Evaluate performance
evaluate_model(svm_model3, testSet)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 101 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 91, 90, 91, 91, 91, ... 
## Resampling results across tuning parameters:
## 
##   C     sigma  Accuracy   Kappa    
##    0.1  0.01   0.6036364  0.0000000
##    0.1  0.10   0.6036364  0.0000000
##    0.1  1.00   0.6036364  0.0000000
##    1.0  0.01   0.6927273  0.2531639
##    1.0  0.10   0.7409091  0.3900678
##    1.0  1.00   0.7209091  0.3673029
##   10.0  0.01   0.7409091  0.3900678
##   10.0  0.10   0.7809091  0.4916112
##   10.0  1.00   0.6818182  0.3187291
##   50.0  0.01   0.7409091  0.3900678
##   50.0  0.10   0.7909091  0.5173029
##   50.0  1.00   0.6718182  0.2954150
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1 and C = 50.
## [1] "Confusion Matrix"
##          Actual
## Predicted  0  1
##         0 24  4
##         1  2 12
## [1] "The accuracy is 0.857142857142857"
## [1] "The recall is 0.857142857142857"
## [1] "The precision is 0.923076923076923"
## [1] "The F1-score is 0.888888888888889"

Here, we tried 2 standard SVM classification models then one SVM classification model with 10-fold cross validation and the best model was the cross validation model

5.1.2 Decision Tree Model

# Set up cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Define hyperparameter grid
tune_grid <- expand.grid(maxdepth = seq(2, 10, by = 2))

# Train the decision tree model
tree_model <- train(
  Target ~ .,
  data = trainSet,
  method = "rpart2",  # Use "rpart2" for explicit control over maxdepth
  trControl = train_control,
  tuneGrid = tune_grid
)


# View results
print(tree_model)

## CART 
## 
## 101 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 91, 91, 91, 90, 91, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    2        0.7536364  0.4214568
##    4        0.6636364  0.2693090
##    6        0.6636364  0.2693090
##    8        0.6636364  0.2693090
##   10        0.6636364  0.2693090
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 2.

# Plot variable importance
varImp(tree_model)

## rpart2 variable importance
## 
##                       Overall
## Total_transaction_amt  100.00
## No_Payments             81.47
## No_Orders               37.31
## No_transactionfail      17.00
## No_Transactions          0.00

# Predict on test + validation set
evaluate_model(tree_model,testSet)

## CART 
## 
## 101 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 91, 91, 91, 90, 91, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    2        0.7536364  0.4214568
##    4        0.6636364  0.2693090
##    6        0.6636364  0.2693090
##    8        0.6636364  0.2693090
##   10        0.6636364  0.2693090
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 2.
## [1] "Confusion Matrix"
##          Actual
## Predicted  0  1
##         0 23  2
##         1  3 14
## [1] "The accuracy is 0.880952380952381"
## [1] "The recall is 0.92"
## [1] "The precision is 0.884615384615385"
## [1] "The F1-score is 0.901960784313726"

# Accuracy of 88.1%

# Visualising the tree
# Plot the tree
rpart.plot(tree_model$finalModel, type = 2, extra = 104)

# Trying out same model with LOOCV instead of K fold cross validation
train_control2 <- trainControl(method = "LOOCV")

tree_model_LOOCV <- train(
  Target ~ .,
  data = trainSet,
  method = "rpart2",  # Use "rpart2" for explicit control over maxdepth
  trControl = train_control2,
  tuneGrid = tune_grid
)

# View results
print(tree_model_LOOCV)

## CART 
## 
## 101 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    2        0.7524752  0.4417422
##    4        0.6633663  0.2712224
##    6        0.6633663  0.2712224
##    8        0.6633663  0.2712224
##   10        0.6633663  0.2712224
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 2.

# Plot variable importance
varImp(tree_model_LOOCV)

## rpart2 variable importance
## 
##                       Overall
## Total_transaction_amt  100.00
## No_Payments             81.47
## No_Orders               37.31
## No_transactionfail      17.00
## No_Transactions          0.00

# Evaluation
evaluate_model(tree_model_LOOCV, testSet)

## CART 
## 
## 101 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    2        0.7524752  0.4417422
##    4        0.6633663  0.2712224
##    6        0.6633663  0.2712224
##    8        0.6633663  0.2712224
##   10        0.6633663  0.2712224
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 2.
## [1] "Confusion Matrix"
##          Actual
## Predicted  0  1
##         0 23  2
##         1  3 14
## [1] "The accuracy is 0.880952380952381"
## [1] "The recall is 0.92"
## [1] "The precision is 0.884615384615385"
## [1] "The F1-score is 0.901960784313726"

SVM classification: the best results we got from it are 85.7% accuracy and 88.9% F1-score on the test set obtained by 10fold cross validation and automatic hyperparameter tuning from a 3x3 grid of possible parameters. The selected parameters are cost = 10, sigma = 0.1
Decision Tree: the best results we got were 88.1% accuracy and 90.2% F1-score on the test set which are the best out of the three models. This was achieved again using 10-fold cross validation, the maxdepth hyperparameter was also tested with different values and was selected automatically. This is surprising as it is the simplest out of the three models, maybe its simplicity on a small dataset such as this one leads to better generalisation and less overfitting.

5.2 Can we predict the total transaction amount for a customer based on their past behavior and transaction patterns?

For regression task, we will be using Logistic regression model

5.2.1 Logistic Regression

# Include additional variables in the regression dataset
regression_data <- dataset[, c("Fraud", "Total_transaction_amt", "No_transactionfail", 
                               "PaypalPayments", "CardPayments", "BitcoinPayments")]

# Ensure the Fraud variable is numeric
regression_data$Fraud <- as.numeric(regression_data$Fraud)

# Split the data into training and testing sets
set.seed(123)  
trainIndex <- createDataPartition(regression_data$Fraud, p = 0.7, list = FALSE)
trainSet <- regression_data[trainIndex, ]
testSet <- regression_data[-trainIndex, ]

# Train a regression model with additional predictors
regression_model <- lm(Fraud ~ Total_transaction_amt + No_transactionfail + 
                         PaypalPayments + CardPayments + BitcoinPayments, 
                       data = trainSet)

# View model summary
summary(regression_model)

## 
## Call:
## lm(formula = Fraud ~ Total_transaction_amt + No_transactionfail + 
##     PaypalPayments + CardPayments + BitcoinPayments, data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6025 -0.3418 -0.1481  0.3449  1.0430 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.3034687  0.0595963   5.092 1.79e-06 ***
## Total_transaction_amt  0.0022819  0.0006715   3.398 0.000993 ***
## No_transactionfail    -0.1116990  0.0496174  -2.251 0.026677 *  
## PaypalPayments        -0.1221321  0.0520226  -2.348 0.020966 *  
## CardPayments          -0.0332625  0.0292411  -1.138 0.258178    
## BitcoinPayments       -0.0106110  0.0518889  -0.204 0.838404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4339 on 95 degrees of freedom
## Multiple R-squared:  0.2598, Adjusted R-squared:  0.2208 
## F-statistic: 6.669 on 5 and 95 DF,  p-value: 2.301e-05

# Make predictions on the test set
predictions <- predict(regression_model, newdata = testSet)

# Calculate Mean Squared Error (MSE) to evaluate the model
mse <- mean((testSet$Fraud - predictions)^2)
cat("Mean Squared Error (MSE):", mse, "\n")

## Mean Squared Error (MSE): 0.1724392

The model’s MSE is 0.1724392, indicating the average squared difference between actual and predicted values for fraud probability.
Intercept: 0.3034687
Predictors’ coefficients suggest their individual influence on fraud prediction, with Total_transaction_amt having a positive influence and No_transactionfail and PaypalPayments showing negative influences.

# Plot the relationship between Total_transaction_amt and Fraud
ggplot(trainSet, aes(x = Total_transaction_amt, y = Fraud)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Regression: Fraud vs Total Transaction Amount",
       x = "Total Transaction Amount",
       y = "Fraud (Probability)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

- Total Transaction Amount is a significant predictor of fraud probability. Customers with higher transaction amounts are more likely to be involved in fraudulent activities, as indicated by the positive correlation. - While there is a clear trend, the data points for low transaction amounts show variability in fraud probability. This means that not all low-value transactions are safe, and not all high-value transactions are fraudulent.

6. Summary

To summarize the classification and regression problem based on the questions above, we used SVM and Decision Tree to identify fraud activity in e-commerce from the transaction attributes, customer behaviour and payment methods. The best model to predict this problem is using the Decision Tree Model with 10-fold Cross Validation. This model gives 88.1% accuracy and 90.2% F1-score on the test set. The model also gives variable importance which is useful for interpretability. The most important variables from the Decision Tree model are the Total_transaction_amt, No_Payments, No_Orders and No_transactionfail.

The second problem is to predict the total transaction amount for the customer based on their past behaviour and transaction pattern. In this scenario, the logistic regression model was used. The model evaluation used was Mean Squared Error (MSE) with the score of 0.1724. This model achieves moderate performance on the dataset. This model is very easy to implement and interpret. From this model we were able to visualize the Fraud probability vs Total Transaction Amount. There is a clear trend between the fraud probability and total transaction amount. But this should not be taken as a face value since low-value transactions are not always safe and vice-versa.

E-commerce Fraud Prevention

Group 2 | Nissa | Veelvili | Syawal | Yang | Alessandro

2025-01-08

E-commerce Fraud Prevention

1. Introduction

1.1 Objectives

1.2 Data Description

2. Data Pre-processing

3. EDA

3.1 Fraud vs Non-Fraud

3.2 Payment Method Type Usage Percentage

3.3 Payment Method Provider vs Method Type

3.4 Count of Order State vs Payment Method Type

3.5 Count of Payment Method Provider vs Method Registration Failure

3.6 Payment Method Provider vs Transaction Failed

3.7 Count of No Payments vs Fraud

4. Data Transformation

4.1 Merging total_transaction_amt column

4.2 Merging transactionfailed column

4.3 Merging paymentMethodRegistrationFailure column

4.4 Merging each payment type column

4.5 Count the number of payments of a specific category for each email

4.6 Merging each payment method provider column

4.7 Merging payment methods by provider

4.8 Correlation Heatmap

5. Data Modelling

5.1 How can we classify transactions as fraudulent or legitimate based on transaction attributes, customer behavior, and payment methods?

5.1.1 SVM Model

5.1.2 Decision Tree Model

5.2 Can we predict the total transaction amount for a customer based on their past behavior and transaction patterns?

5.2.1 Logistic Regression

6. Summary