Credit Card Transactions Fraud Detection

Project Background

Credit card fraud detection has emerged as a critical challenge in modern financial transactions due to the increasing prevalence of online transactions and digital payment methods. The unauthorized use of credit cards for fraudulent activities poses substantial financial risks to individuals, businesses, and financial institutions. To combat this issue, sophisticated fraud detection systems are required to swiftly identify and prevent fraudulent transactions, ensuring the security and trustworthiness of financial operations.

This project aims to develop a credit card fraud detection system using machine learning algorithms and advanced analytics techniques. The system will analyze transaction data in real-time to identify potentially fraudulent activities and alert users or financial institutions to take appropriate actions.

This project is suitable for financial institutions such as banks, credit card companies, and payment processors that want to enhance their fraud detection capabilities and protect customers from fraudulent transactions.

The primary users of the credit card fraud detection system include fraud analysts, risk managers, and security teams within financial organizations. Additionally, customers who use credit cards for online or in-person transactions will indirectly benefit from the increased security provided by the system.

By implementing an effective fraud detection system, organizations can reduce financial losses due to fraudulent transactions, minimize reputational damage, and enhance customer trust and loyalty. Additionally, the system can help improve regulatory compliance and mitigate legal risks associated with fraud-related incidents.

Research Question

Project Objectives

Dataset

Variable Table

Variable Descrtiption Data type
X Unique identifier integer
trans_date_trans_time Transaction date and time datetime
cc_num Credit card number of customers integer
merchant Name of merchant categorical
category categorical
amt Amount float
first first name categorical
last last name categorical
gender categorical
street street name categorical
city city name categorical
state state name categorical
zip Zip code categorical
lat Latitude float
long Longitude float
city_pop City population integer
job Job title categorical
trans_num categorical
unix_time datetime
merch_lat Merchant latitude float
merch_long Merchant longitude float
is_fraud Target variable: fraud status (0=Not fraud, 1=Fraud) categorical


- New variable(s) created

Variable Descrtiption Data type
age Age integer


- Variables to be removed

Variable Descrtiption Data type Reason
dob Date of birth datetime New ‘age’ column created from ‘dob’ column

Part 0: Load data

# import necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(caret)
## Loading required package: lattice
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# quite slow to load because the files are too big # take around 5 minutes, once loaded, you don't need to re-run this code block
fraudTrain_df_raw <- read.csv("./fraudTrain.csv")
fraudTest_df_raw <- read.csv("./fraudTest.csv")

fraudTrain_df <- data.frame(fraudTrain_df_raw)
fraudTest_df <- data.frame(fraudTest_df_raw)

# combine the train and test data
combined_fraud_df_raw <- bind_rows(fraudTrain_df, fraudTest_df)
head(combined_fraud_df_raw, 10)
##    X trans_date_trans_time       cc_num                           merchant
## 1  0   2019-01-01 00:00:18 2.703186e+15         fraud_Rippin, Kub and Mann
## 2  1   2019-01-01 00:00:44 6.304233e+11    fraud_Heller, Gutmann and Zieme
## 3  2   2019-01-01 00:00:51 3.885949e+13               fraud_Lind-Buckridge
## 4  3   2019-01-01 00:01:16 3.534094e+15 fraud_Kutch, Hermiston and Farrell
## 5  4   2019-01-01 00:03:06 3.755342e+14                fraud_Keeling-Crist
## 6  5   2019-01-01 00:04:08 4.767265e+15   fraud_Stroman, Hudson and Erdman
## 7  6   2019-01-01 00:04:42 3.007469e+13              fraud_Rowe-Vandervort
## 8  7   2019-01-01 00:05:08 6.011361e+15               fraud_Corwin-Collins
## 9  8   2019-01-01 00:05:18 4.922711e+15                   fraud_Herzog Ltd
## 10 9   2019-01-01 00:06:01 2.720830e+15  fraud_Schoen, Kuphal and Nitzsche
##         category    amt     first     last gender
## 1       misc_net   4.97  Jennifer    Banks      F
## 2    grocery_pos 107.23 Stephanie     Gill      F
## 3  entertainment 220.11    Edward  Sanchez      M
## 4  gas_transport  45.00    Jeremy    White      M
## 5       misc_pos  41.96     Tyler   Garcia      M
## 6  gas_transport  94.63  Jennifer   Conner      F
## 7    grocery_net  44.54    Kelsey Richards      F
## 8  gas_transport  71.65    Steven Williams      M
## 9       misc_pos   4.27   Heather    Chase      F
## 10   grocery_pos 198.39   Melissa  Aguilar      F
##                            street           city state   zip     lat      long
## 1                  561 Perry Cove Moravian Falls    NC 28654 36.0788  -81.1781
## 2    43039 Riley Greens Suite 393         Orient    WA 99160 48.8878 -118.2105
## 3        594 White Dale Suite 530     Malad City    ID 83252 42.1808 -112.2620
## 4     9443 Cynthia Court Apt. 038        Boulder    MT 59632 46.2306 -112.1138
## 5                408 Bradley Rest       Doe Hill    VA 24433 38.4207  -79.4629
## 6               4655 David Island         Dublin    PA 18917 40.3750  -75.2045
## 7     889 Sarah Station Suite 624        Holcomb    KS 67851 37.9931 -100.9893
## 8       231 Flores Pass Suite 720       Edinburg    VA 22824 38.8432  -78.6003
## 9     6888 Hicks Stream Suite 954          Manor    PA 15665 40.3359  -79.6607
## 10 21326 Taylor Squares Suite 708    Clarksville    TN 37040 36.5220  -87.3490
##    city_pop                               job        dob
## 1      3495         Psychologist, counselling 1988-03-09
## 2       149 Special educational needs teacher 1978-06-21
## 3      4154       Nature conservation officer 1962-01-19
## 4      1939                   Patent attorney 1967-01-12
## 5        99    Dance movement psychotherapist 1986-03-28
## 6      2158                 Transport planner 1961-06-19
## 7      2691                   Arboriculturist 1993-08-16
## 8      6018              Designer, multimedia 1947-08-21
## 9      1472         Public affairs consultant 1941-03-07
## 10   151785                       Pathologist 1974-03-28
##                           trans_num  unix_time merch_lat merch_long is_fraud
## 1  0b242abb623afc578575680df30655b9 1325376018  36.01129  -82.04832        0
## 2  1f76529f8574734946361c461b024d99 1325376044  49.15905 -118.18646        0
## 3  a1a22d70485983eac12b5b88dad1cf95 1325376051  43.15070 -112.15448        0
## 4  6b849c168bdad6f867558c3793159a81 1325376076  47.03433 -112.56107        0
## 5  a41d7549acf90789359a9aa5346dcb46 1325376186  38.67500  -78.63246        0
## 6  189a841a0a8ba03058526bcfe566aab5 1325376248  40.65338  -76.15267        0
## 7  83ec1cc84142af6e2acf10c44949e720 1325376282  37.16270 -100.15337        0
## 8  6d294ed2cc447d2c71c7171a3d54967c 1325376308  38.94809  -78.54030        0
## 9  fc28024ce480f8ef21a32d64c93a29f5 1325376318  40.35181  -79.95815        0
## 10 3b9014ea8fb80bd65de0b1463b00b00e 1325376361  37.17920  -87.48538        0
# save another dataframe copy so that we need not to re-download the file with the script above when we messed up with the dataset
# remove the first column: 'X', which is just the numbering
fraud_df <- combined_fraud_df_raw[-1,]
glimpse(fraud_df)
## Rows: 593,664
## Columns: 23
## $ X                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ trans_date_trans_time <chr> "2019-01-01 00:00:44", "2019-01-01 00:00:51", "2…
## $ cc_num                <dbl> 6.304233e+11, 3.885949e+13, 3.534094e+15, 3.7553…
## $ merchant              <chr> "fraud_Heller, Gutmann and Zieme", "fraud_Lind-B…
## $ category              <chr> "grocery_pos", "entertainment", "gas_transport",…
## $ amt                   <dbl> 107.23, 220.11, 45.00, 41.96, 94.63, 44.54, 71.6…
## $ first                 <chr> "Stephanie", "Edward", "Jeremy", "Tyler", "Jenni…
## $ last                  <chr> "Gill", "Sanchez", "White", "Garcia", "Conner", …
## $ gender                <chr> "F", "M", "M", "M", "F", "F", "M", "F", "F", "M"…
## $ street                <chr> "43039 Riley Greens Suite 393", "594 White Dale …
## $ city                  <chr> "Orient", "Malad City", "Boulder", "Doe Hill", "…
## $ state                 <chr> "WA", "ID", "MT", "VA", "PA", "KS", "VA", "PA", …
## $ zip                   <int> 99160, 83252, 59632, 24433, 18917, 67851, 22824,…
## $ lat                   <dbl> 48.8878, 42.1808, 46.2306, 38.4207, 40.3750, 37.…
## $ long                  <dbl> -118.2105, -112.2620, -112.1138, -79.4629, -75.2…
## $ city_pop              <int> 149, 4154, 1939, 99, 2158, 2691, 6018, 1472, 151…
## $ job                   <chr> "Special educational needs teacher", "Nature con…
## $ dob                   <chr> "1978-06-21", "1962-01-19", "1967-01-12", "1986-…
## $ trans_num             <chr> "1f76529f8574734946361c461b024d99", "a1a22d70485…
## $ unix_time             <int> 1325376044, 1325376051, 1325376076, 1325376186, …
## $ merch_lat             <dbl> 49.15905, 43.15070, 47.03433, 38.67500, 40.65338…
## $ merch_long            <dbl> -118.18646, -112.15448, -112.56107, -78.63246, -…
## $ is_fraud              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# some simple summary statistics of all of the variables in our fraud_df data frame
summary(fraud_df)
##        X          trans_date_trans_time     cc_num            merchant        
##  Min.   :     0   Length:593664         Min.   :6.042e+10   Length:593664     
##  1st Qu.:110471   Class :character      1st Qu.:1.800e+14   Class :character  
##  Median :258887   Mode  :character      Median :3.521e+15   Mode  :character  
##  Mean   :261312                         Mean   :4.176e+17                     
##  3rd Qu.:407302                         3rd Qu.:4.635e+15                     
##  Max.   :555718                         Max.   :4.992e+18                     
##                                                                               
##    category              amt              first               last          
##  Length:593664      Min.   :    1.00   Length:593664      Length:593664     
##  Class :character   1st Qu.:    9.63   Class :character   Class :character  
##  Mode  :character   Median :   47.34   Mode  :character   Mode  :character  
##                     Mean   :   69.56                                        
##                     3rd Qu.:   83.06                                        
##                     Max.   :22768.11                                        
##                                                                             
##     gender             street              city              state          
##  Length:593664      Length:593664      Length:593664      Length:593664     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       zip             lat             long            city_pop      
##  Min.   : 1257   Min.   :20.03   Min.   :-165.67   Min.   :     23  
##  1st Qu.:26237   1st Qu.:34.67   1st Qu.: -96.80   1st Qu.:    741  
##  Median :48174   Median :39.37   Median : -87.48   Median :   2408  
##  Mean   :48835   Mean   :38.54   Mean   : -90.23   Mean   :  88300  
##  3rd Qu.:72011   3rd Qu.:41.89   3rd Qu.: -80.18   3rd Qu.:  19685  
##  Max.   :99921   Max.   :65.69   Max.   : -67.95   Max.   :2906700  
##                                                                     
##      job                dob             trans_num           unix_time        
##  Length:593664      Length:593664      Length:593664      Min.   :1.325e+09  
##  Class :character   Class :character   Class :character   1st Qu.:1.375e+09  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.380e+09  
##                                                           Mean   :1.377e+09  
##                                                           3rd Qu.:1.386e+09  
##                                                           Max.   :1.389e+09  
##                                                           NA's   :1          
##    merch_lat       merch_long         is_fraud       
##  Min.   :19.03   Min.   :-166.67   Min.   :0.000000  
##  1st Qu.:34.76   1st Qu.: -96.91   1st Qu.:0.000000  
##  Median :39.37   Median : -87.44   Median :0.000000  
##  Mean   :38.54   Mean   : -90.23   Mean   :0.004245  
##  3rd Qu.:41.95   3rd Qu.: -80.26   3rd Qu.:0.000000  
##  Max.   :66.68   Max.   : -66.95   Max.   :1.000000  
##  NA's   :1       NA's   :1         NA's   :1
# Basic statistics for all the variables
# get total count, unique count and missing value count of each columns
summary_table <- fraud_df %>% reframe(
      name=as.vector(names(fraud_df)),
      count=n(), # total count
      unique=sapply(fraud_df, function (x){length(unique(x))}), # unique count
      missing_value_count=sapply(fraud_df, function(x){sum(is.na(x))}) # check missing values
)
summary_table
##                     name  count unique missing_value_count
## 1                      X 593664 555719                   0
## 2  trans_date_trans_time 593664 582252                   0
## 3                 cc_num 593664    925                   0
## 4               merchant 593664    693                   0
## 5               category 593664     14                   0
## 6                    amt 593664  38297                   0
## 7                  first 593664    341                   0
## 8                   last 593664    471                   0
## 9                 gender 593664      2                   0
## 10                street 593664    925                   0
## 11                  city 593664    849                   0
## 12                 state 593664     50                   0
## 13                   zip 593664    913                   0
## 14                   lat 593664    911                   0
## 15                  long 593664    911                   0
## 16              city_pop 593664    835                   0
## 17                   job 593664    478                   0
## 18                   dob 593664    911                   0
## 19             trans_num 593664 593664                   0
## 20             unix_time 593664 582252                   1
## 21             merch_lat 593664 583142                   1
## 22            merch_long 593664 589156                   1
## 23              is_fraud 593664      3                   1

Part 1: Data pre-processing & Data cleaning

1.1 Check missing values (if any)

# removing null values
if (any(is.na(fraud_df))) {
  no_of_rows_with_missing_values <- sum(is.na(fraud_df))
  fraud_df <- na.omit(fraud_df)
  print(paste0(no_of_rows_with_missing_values,
  "Rows with missing values are removed"))
} else {
  print("There is no missing values")
}
## [1] "4Rows with missing values are removed"

1.2 Check duplicated values (if any)

# check if there is any duplicated rows in the fraud_df dataset
if (sum(duplicated(fraud_df)) > 0) {
  fraud_df <- fraud_df[!duplicated(fraud_df), ]
} else {
  print("No duplicated rows found!")
}
## [1] "No duplicated rows found!"

1.3 Assign correct data type to each variables

# assign correct data type to each variable in the data frame (e.g., factor, vector, list, etc)
fraud_df$cc_num <- as.factor(fraud_df$cc_num)
fraud_df$merchant <- as.factor(fraud_df$merchant)
fraud_df$category<- as.factor(fraud_df$category)
fraud_df$amt <- as.numeric(fraud_df$amt)
fraud_df$first <- as.factor(fraud_df$first)
fraud_df$last <- as.factor(fraud_df$last)
fraud_df$gender <- as.factor(fraud_df$gender)
fraud_df$street <- as.factor(fraud_df$street)
fraud_df$city <- as.factor(fraud_df$city)
fraud_df$state <- as.factor(fraud_df$state)
fraud_df$zip <- as.factor(fraud_df$zip)
fraud_df$lat <- as.numeric(fraud_df$lat)
fraud_df$long <- as.numeric(fraud_df$long)
fraud_df$city_pop <- as.numeric(fraud_df$city_pop)
fraud_df$job <- as.factor(fraud_df$job)
fraud_df$dob <- as.Date(fraud_df$dob)
fraud_df$trans_num <- as.factor(fraud_df$trans_num)
fraud_df$unix_time <- as.Date(as.POSIXct(fraud_df$unix_time, origin="1970-01-01")) # convert unix time to Date object
fraud_df$merch_lat <- as.numeric(fraud_df$merch_lat)
fraud_df$merch_long <- as.numeric(fraud_df$merch_long)
fraud_df$is_fraud <- as.factor(fraud_df$is_fraud)
# check back again the data type of each variables after data type conversion in the previous step
glimpse(fraud_df)
## Rows: 593,663
## Columns: 23
## $ X                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ trans_date_trans_time <chr> "2019-01-01 00:00:44", "2019-01-01 00:00:51", "2…
## $ cc_num                <fct> 630423337322, 38859492057661, 3534093764340240, …
## $ merchant              <fct> "fraud_Heller, Gutmann and Zieme", "fraud_Lind-B…
## $ category              <fct> grocery_pos, entertainment, gas_transport, misc_…
## $ amt                   <dbl> 107.23, 220.11, 45.00, 41.96, 94.63, 44.54, 71.6…
## $ first                 <fct> Stephanie, Edward, Jeremy, Tyler, Jennifer, Kels…
## $ last                  <fct> Gill, Sanchez, White, Garcia, Conner, Richards, …
## $ gender                <fct> F, M, M, M, F, F, M, F, F, M, F, M, M, M, M, F, …
## $ street                <fct> 43039 Riley Greens Suite 393, 594 White Dale Sui…
## $ city                  <fct> Orient, Malad City, Boulder, Doe Hill, Dublin, H…
## $ state                 <fct> WA, ID, MT, VA, PA, KS, VA, PA, TN, IA, WV, FL, …
## $ zip                   <fct> 99160, 83252, 59632, 24433, 18917, 67851, 22824,…
## $ lat                   <dbl> 48.8878, 42.1808, 46.2306, 38.4207, 40.3750, 37.…
## $ long                  <dbl> -118.2105, -112.2620, -112.1138, -79.4629, -75.2…
## $ city_pop              <dbl> 149, 4154, 1939, 99, 2158, 2691, 6018, 1472, 151…
## $ job                   <fct> "Special educational needs teacher", "Nature con…
## $ dob                   <date> 1978-06-21, 1962-01-19, 1967-01-12, 1986-03-28,…
## $ trans_num             <fct> 1f76529f8574734946361c461b024d99, a1a22d70485983…
## $ unix_time             <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01,…
## $ merch_lat             <dbl> 49.15905, 43.15070, 47.03433, 38.67500, 40.65338…
## $ merch_long            <dbl> -118.18646, -112.15448, -112.56107, -78.63246, -…
## $ is_fraud              <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

1.4 Data transformation - new column ‘age’ created

# create a new column called "age"
reference_date <- as.Date("2019-01-01")
fraud_df$age <- as.numeric(lubridate::interval(fraud_df$dob, reference_date) / years(1))
fraud_df <- subset(fraud_df, select=c(-dob))

# convert the data type of 'age' column to numeric
fraud_df$age <- as.numeric(fraud_df$age)

head(fraud_df,10)
##     X trans_date_trans_time           cc_num                           merchant
## 2   1   2019-01-01 00:00:44     630423337322    fraud_Heller, Gutmann and Zieme
## 3   2   2019-01-01 00:00:51   38859492057661               fraud_Lind-Buckridge
## 4   3   2019-01-01 00:01:16 3534093764340240 fraud_Kutch, Hermiston and Farrell
## 5   4   2019-01-01 00:03:06  375534208663984                fraud_Keeling-Crist
## 6   5   2019-01-01 00:04:08 4767265376804500   fraud_Stroman, Hudson and Erdman
## 7   6   2019-01-01 00:04:42   30074693890476              fraud_Rowe-Vandervort
## 8   7   2019-01-01 00:05:08 6011360759745864               fraud_Corwin-Collins
## 9   8   2019-01-01 00:05:18 4922710831011201                   fraud_Herzog Ltd
## 10  9   2019-01-01 00:06:01 2720830304681674  fraud_Schoen, Kuphal and Nitzsche
## 11 10   2019-01-01 00:06:23    4642894980163             fraud_Rutherford-Mertz
##         category    amt     first     last gender
## 2    grocery_pos 107.23 Stephanie     Gill      F
## 3  entertainment 220.11    Edward  Sanchez      M
## 4  gas_transport  45.00    Jeremy    White      M
## 5       misc_pos  41.96     Tyler   Garcia      M
## 6  gas_transport  94.63  Jennifer   Conner      F
## 7    grocery_net  44.54    Kelsey Richards      F
## 8  gas_transport  71.65    Steven Williams      M
## 9       misc_pos   4.27   Heather    Chase      F
## 10   grocery_pos 198.39   Melissa  Aguilar      F
## 11   grocery_pos  24.74     Eddie   Mendez      M
##                            street        city state   zip     lat      long
## 2    43039 Riley Greens Suite 393      Orient    WA 99160 48.8878 -118.2105
## 3        594 White Dale Suite 530  Malad City    ID 83252 42.1808 -112.2620
## 4     9443 Cynthia Court Apt. 038     Boulder    MT 59632 46.2306 -112.1138
## 5                408 Bradley Rest    Doe Hill    VA 24433 38.4207  -79.4629
## 6               4655 David Island      Dublin    PA 18917 40.3750  -75.2045
## 7     889 Sarah Station Suite 624     Holcomb    KS 67851 37.9931 -100.9893
## 8       231 Flores Pass Suite 720    Edinburg    VA 22824 38.8432  -78.6003
## 9     6888 Hicks Stream Suite 954       Manor    PA 15665 40.3359  -79.6607
## 10 21326 Taylor Squares Suite 708 Clarksville    TN 37040 36.5220  -87.3490
## 11      1831 Faith View Suite 653    Clarinda    IA 51632 40.7491  -95.0380
##    city_pop                               job                        trans_num
## 2       149 Special educational needs teacher 1f76529f8574734946361c461b024d99
## 3      4154       Nature conservation officer a1a22d70485983eac12b5b88dad1cf95
## 4      1939                   Patent attorney 6b849c168bdad6f867558c3793159a81
## 5        99    Dance movement psychotherapist a41d7549acf90789359a9aa5346dcb46
## 6      2158                 Transport planner 189a841a0a8ba03058526bcfe566aab5
## 7      2691                   Arboriculturist 83ec1cc84142af6e2acf10c44949e720
## 8      6018              Designer, multimedia 6d294ed2cc447d2c71c7171a3d54967c
## 9      1472         Public affairs consultant fc28024ce480f8ef21a32d64c93a29f5
## 10   151785                       Pathologist 3b9014ea8fb80bd65de0b1463b00b00e
## 11     7297                        IT trainer d71c95ab6b7356dd74389d41df429c87
##     unix_time merch_lat merch_long is_fraud      age
## 2  2012-01-01  49.15905 -118.18646        0 40.53151
## 3  2012-01-01  43.15070 -112.15448        0 56.95068
## 4  2012-01-01  47.03433 -112.56107        0 51.96986
## 5  2012-01-01  38.67500  -78.63246        0 32.76438
## 6  2012-01-01  40.65338  -76.15267        0 57.53699
## 7  2012-01-01  37.16270 -100.15337        0 25.37808
## 8  2012-01-01  38.94809  -78.54030        0 71.36438
## 9  2012-01-01  40.35181  -79.95815        0 77.82192
## 10 2012-01-01  37.17920  -87.48538        0 44.76438
## 11 2012-01-01  40.27589  -96.01155        0 28.47123

Part 2: Exploratory Data analysis

2.1 Univariate analysis

2.2 Bivariate analysis

2.2.1 Comparing Transaction amounts between Genuine vs Fraud category

genuine_vs_fraud_amt <- fraud_df %>%
  mutate(is_fraud_label = ifelse(is_fraud == 0, "Genuine", "Fraud"))

# Plotting boxplot
boxplot(amt ~ is_fraud_label, data = genuine_vs_fraud_amt,
        main = "Comparing Genuine vs Fraud average amounts")

2.2.2 Percentage of fraud across various categories (%)

fraud_category_count <- fraud_df %>% group_by(is_fraud, category
    ) %>% reframe(count=n()) %>% spread(is_fraud, count)

# rename the column variables
colnames(fraud_category_count) <- c("category", "category_0", "category_1")

fraud_category_count <- fraud_category_count %>% mutate(
      total_count=category_0+category_1) %>% mutate(
        fraud_percent=category_1/total_count*100) %>% arrange(desc(fraud_percent))

ggplot(fraud_category_count, aes(
  x=reorder(category,-fraud_percent), y=fraud_percent)
  ) + geom_bar(stat='identity', width=0.5) + labs(
    title="Percentage of fraud across various categories (%)",
    x="Category",
    y="Percentage of fraud (%)") + geom_text(
      aes(label=round(fraud_percent, 2)), vjust=-0.5
    ) + theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    plot.margin = unit(c(1, 1, 1, 1), "cm"),  # Adjusts the margins of the plot
    legend.position = "none"
  )

2.2.3 Top 10 cities that are most often affected by fraud cases

# Who are the top 10 cities that are most often affected by fraud cases

city_fraud_case_count <- fraud_df[which(fraud_df$is_fraud == "1"),
        ] %>% group_by(city) %>% summarise(count=n()) %>% arrange(desc(count))

ggplot(head(city_fraud_case_count,10), aes(x=reorder(city, -count), y=count)
      ) + geom_bar(stat='identity'
      ) + geom_text(aes(label=count), vjust=-0.5
      ) + theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    plot.margin = unit(c(1, 1, 1, 1), "cm"),  # Adjusts the margins of the plot
    legend.position = "none"
  ) + labs(
    title = "Top 10 cities most often affected by fraud cases",
    x = "City name",
    y = "Count"
  )

2.2.4 Top 10 merchants most often affected by fraud cases

# Who are the top 10 merchants most often affected by fraud cases?

merchant_fraud_case_count <- fraud_df[which(fraud_df$is_fraud == "1"),
        ] %>% group_by(merchant) %>% summarise(count=n()) %>% arrange(desc(count))

ggplot(head(merchant_fraud_case_count,10), aes(x=reorder(merchant, -count), y=count)
      ) + geom_bar(stat='identity'
      ) + geom_text(aes(label=count), vjust=-0.5
      ) + theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    plot.margin = unit(c(1, 1, 1, 1), "cm"),  # Adjusts the margins of the plot
    legend.position = "none"
  ) + labs(
    title = "Top 10 merchants most often affected by fraud cases",
    x = "Merchant name",
    y = "Count"
  )

2.2.5 Age Distribution by Fraud Status

adjusted_fraud_df <- fraud_df %>% mutate(is_fraud_label =
  ifelse(is_fraud == 0, "Genuine", "Fraud"))

ggplot(adjusted_fraud_df, aes(x = is_fraud_label,
  y = age, fill = is_fraud_label)) +
    geom_boxplot() +
    labs(
      title = "How does age differ by Fraud Status?",
      x = "Fraud status",
      y = "Age"
    ) +
    theme_minimal() +
    theme(
      strip.background = element_blank(),
      strip.text = element_text(size = 12, face = "bold"),
      plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
    ) + scale_fill_manual(values = c("red", "cyan"))

Statistics check 1: Are those who commit in fraud are generally older?

Independent t-test

  • Null Hypothesis (H0): The mean age of individuals in fraud cases is less than or equal to the mean age of individuals in non-fraud cases.
  • Alternative Hypothesis (H1): The mean age of individuals in fraud cases is greater than the mean age of individuals in non-fraud cases.

\[H_0: \mu_{\text{age_fraud}} =< \mu_{\text{age_non-fraud}}\] \[H_1: \mu_{\text{age_fraud}} > \mu_{\text{age_non-fraud}}\]

genuine_group_with_age <- fraud_df %>% filter(is_fraud==0) %>% select(age)
fraud_group_with_age <- fraud_df %>% filter(is_fraud==1) %>% select(age)

# independent t-test
res <- t.test(fraud_group_with_age, genuine_group_with_age,
     alternative='greater', conf.level=0.95)
res
## 
##  Welch Two Sample t-test
## 
## data:  fraud_group_with_age and genuine_group_with_age
## t = 6.2496, df = 2539.1, p-value = 2.406e-10
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  1.655274      Inf
## sample estimates:
## mean of x mean of y 
##  47.38756  45.14071
if (res$p.value > 0.05){
  print("The mean age of individuals in fraud cases is less than or equal to the mean age of individuals in non-fraud cases.")
} else {
genuine_group_with_age <- fraud_df %>% filter(is_fraud==0) %>% select(age)
  print("The mean age of individuals in fraud cases is greater than the mean age of individuals in non-fraud cases.")
}
## [1] "The mean age of individuals in fraud cases is greater than the mean age of individuals in non-fraud cases."

Statistics check 2: Is age independent of fraud status?

Chi-Square Goodness of Test of Independence

Null hypothesis (H0): There is no association between age group and fraud status. In other words, the distribution of age groups is the same for fraudulent and genuine cases.

Alternative hypothesis (H1): There is an association between age group and fraud status. In other words, the distribution of age groups differs between fraudulent and genuine cases.

# Load required library
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
age_group <- c("0-30", "30-60", "60+")
adjusted_fraud_df <- fraud_df %>% mutate(is_fraud_label = ifelse(is_fraud == 0, "Genuine", "Fraud"))
adjusted_fraud_df$age_group <- cut(adjusted_fraud_df$age,
                          breaks=c(0, 30, 60, Inf),
                          labels=age_group,
                          right=TRUE
)

matrix_data <-adjusted_fraud_df %>% group_by(age_group, is_fraud_label
            ) %>% reframe(count=n()) %>% spread(is_fraud_label, count)

matrix_data <- as.table(rbind(matrix_data$Fraud, matrix_data$Genuine))

# Perform the Chi-Square Test
res <- chisq.test(matrix_data)
res
## 
##  Pearson's Chi-squared test
## 
## data:  matrix_data
## X-squared = 34.876, df = 2, p-value = 2.671e-08
if (res$p.value < 0.05){
  print("CONCLUSION: There is a statistically significant association between age group and fraud status at the 0.05 level. The distribution of age groups differs between fraudulent and genuine cases.")
} else {
  print("CONCLUSION: There is no statistically significant association between age group and fraud status at the 0.05 level. The distribution of age groups is similar for fraudulent and genuine cases.")
}
## [1] "CONCLUSION: There is a statistically significant association between age group and fraud status at the 0.05 level. The distribution of age groups differs between fraudulent and genuine cases."

2.3 Multivariate analysis

# Correlation analysis: what features are correlated?
numerical_vars <- subset(fraud_df, select = c(amt, lat, long, city_pop, merch_lat, merch_long, age))
relation <- cor(numerical_vars, method="pearson")
corrplot(relation, method = 'number')

# Ans: merch_lat is strongly correlated with lat, and merch_long is strongly correlated with long
#      we might remove any either of the correlated variable in our statistics and machine learning models

Part 3: Preparing data for Modelling

# Load necessary libraries for modelling
library(tibble)
library(dplyr)
library(caret)
library(stringr)
library(tidyr)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ pastecs::extract()   masks tidyr::extract()
## ✖ plotly::filter()     masks dplyr::filter(), stats::filter()
## ✖ pastecs::first()     masks dplyr::first()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ pastecs::last()      masks dplyr::last()
## ✖ purrr::lift()        masks caret::lift()
## ✖ plotly::select()     masks MASS::select(), dplyr::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:gridExtra':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(e1071)
library(rsample)
## 
## Attaching package: 'rsample'
## 
## The following object is masked from 'package:e1071':
## 
##     permutations
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:plotly':
## 
##     slice
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(gbm)
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(ROSE)
## Loaded ROSE 0.0-4
library(class)
# copy the fraud_df dataframe to be used in modelling
model_df <- fraud_df
# Detecting whether the data is balanced or not
class_proportion <- prop.table(table(model_df$is_fraud))
class_proportion
## 
##           0           1 
## 0.995755167 0.004244833

3.1 Mixed sampling for the credit card fraud dataset

With ROSE library package, it applies: - Oversampling for fraud category - Undersampling for non-fraud category

library(ROSE)
# mixed sampling
set.seed(123) # for reproducibility of results
balanced_data <- ovun.sample(is_fraud ~ ., data = model_df, method = "both", N = 10000, p = 0.5, seed = 1)$data

# view the balanced dataset
table(balanced_data$is_fraud)
## 
##    0    1 
## 5047 4953

3.1.1 Chart Comparison of the dataset distribution before and after undersampling

# generate a map of the data distribution before processing
plot_original <- ggplot(model_df,aes(x = is_fraud)) +
  geom_bar(fill = "skyblue") +
  labs(title = "Original Data Distribution", x = "Is Fraud", y = "Count")

# generate a map of the processed data distribution
plot_balanced <- ggplot(balanced_data, aes(x = is_fraud)) +
  geom_bar(fill = "lightgreen") +
  labs(title = "Balanced Data Distribution", x = "Is Fraud", y = "Count")

# display the two images side by side
grid.arrange(plot_original, plot_balanced, ncol = 2)

# Read and preprocess data
balanced_model_df <- balanced_data %>%
  mutate(
    category = as.factor(category),
    gender = as.factor(gender),
    city = as.factor(city),
    state = as.factor(state),
    job = as.factor(job),
    amt = as.integer(amt),
    is_fraud = factor(is_fraud, levels = c(0, 1), labels = c("Not_Fraud", "Fraud"))
  ) %>%
  select(-trans_date_trans_time)

# Print the frequency of the is_fraud column
print(table(balanced_model_df$is_fraud))
## 
## Not_Fraud     Fraud 
##      5047      4953

3.2 Train/Test Data Splitting

# Set random seed
set.seed(123)

# Add row names as ID column to the dataframe
df_new <- rownames_to_column(balanced_model_df, var = "id") %>% mutate(id = as.integer(id))

# Stratified sampling to create training and testing sets
set.seed(123)
split <- initial_split(df_new, prop = 0.70, strata = "is_fraud")
training <- training(split) %>% as.data.frame()
testing <- testing(split) %>% as.data.frame()

# View the proportion of is_fraud column in training and testing sets
print(paste0("There are ", length(training$is_fraud), " rows of data in training set."))
## [1] "There are 6999 rows of data in training set."
print(paste0("There are ", length(testing$is_fraud), " rows of data in testing set."))
## [1] "There are 3001 rows of data in testing set."
# check if the training dataset is clean from missing values
summary_table <- training %>% reframe(
      name=as.vector(names(training)),
      count=n(), # total count
      unique=sapply(training, function (x){length(unique(x))}), # unique count
      missing_value_count=sapply(training, function(x){sum(is.na(x))}) # check missing values
)
summary_table
##          name count unique missing_value_count
## 1          id  6999   6999                   0
## 2           X  6999   5420                   0
## 3      cc_num  6999    875                   0
## 4    merchant  6999    690                   0
## 5    category  6999     14                   0
## 6         amt  6999    838                   0
## 7       first  6999    335                   0
## 8        last  6999    452                   0
## 9      gender  6999      2                   0
## 10     street  6999    875                   0
## 11       city  6999    812                   0
## 12      state  6999     50                   0
## 13        zip  6999    865                   0
## 14        lat  6999    863                   0
## 15       long  6999    863                   0
## 16   city_pop  6999    800                   0
## 17        job  6999    464                   0
## 18  trans_num  6999   5425                   0
## 19  unix_time  6999    216                   0
## 20  merch_lat  6999   5424                   0
## 21 merch_long  6999   5425                   0
## 22   is_fraud  6999      2                   0
## 23        age  6999    861                   0

Part 4: Modelling & Performance Evaluation

4.1 Decision Tree

# Create trainControl object
set.seed(123)
train_control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)

# Decision tree model fitting
tune_grid <- expand.grid(cp = seq(0.01, 0.1, by = 0.01))
model_dt <- train(
  is_fraud ~ amt + gender + city_pop + category +
    state + job,
  data = training,
  method = "rpart",
  trControl = train_control,
  tuneGrid = tune_grid,
  metric = "ROC"
)

# Print the best parameters
print(model_dt$bestTune)
##     cp
## 1 0.01
# Model prediction
pred_dt_class <- predict(model_dt, newdata = testing)
cm_dt<-confusionMatrix(pred_dt_class, testing$is_fraud,positive="Fraud")
cm_dt
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not_Fraud Fraud
##   Not_Fraud      1404   150
##   Fraud           111  1336
##                                           
##                Accuracy : 0.913           
##                  95% CI : (0.9024, 0.9229)
##     No Information Rate : 0.5048          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.826           
##                                           
##  Mcnemar's Test P-Value : 0.01867         
##                                           
##             Sensitivity : 0.8991          
##             Specificity : 0.9267          
##          Pos Pred Value : 0.9233          
##          Neg Pred Value : 0.9035          
##              Prevalence : 0.4952          
##          Detection Rate : 0.4452          
##    Detection Prevalence : 0.4822          
##       Balanced Accuracy : 0.9129          
##                                           
##        'Positive' Class : Fraud           
## 

4.2 Random Forest

# Random forest model fitting
set.seed(123)
model_rf <- randomForest(
  is_fraud ~ amt + city_pop,
  data = training,
  ntree = 500,
  mtry = 2
)

# Model prediction
pred_rf_class <- predict(model_rf, newdata = testing)
cm_rf<-confusionMatrix(pred_rf_class, testing$is_fraud,positive="Fraud")
cm_rf
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not_Fraud Fraud
##   Not_Fraud      1381    61
##   Fraud           134  1425
##                                           
##                Accuracy : 0.935           
##                  95% CI : (0.9256, 0.9436)
##     No Information Rate : 0.5048          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8701          
##                                           
##  Mcnemar's Test P-Value : 2.522e-07       
##                                           
##             Sensitivity : 0.9590          
##             Specificity : 0.9116          
##          Pos Pred Value : 0.9140          
##          Neg Pred Value : 0.9577          
##              Prevalence : 0.4952          
##          Detection Rate : 0.4748          
##    Detection Prevalence : 0.5195          
##       Balanced Accuracy : 0.9353          
##                                           
##        'Positive' Class : Fraud           
## 

4.3 Logistics Regression

# Install and load glmnet
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
# Prepare data for glmnet
x_train <- model.matrix(is_fraud ~ amt + gender + city_pop + category + state, training)[, -1]
y_train <- as.factor(training$is_fraud)

# Fit logistic regression with L1 regularization (lasso)
set.seed(123)
cv_fit <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)

# Prepare the test data
x_test <- model.matrix(is_fraud ~ amt + gender + city_pop + category + state, testing)[, -1]

# Predict on the test set
pred_lr_proba <- predict(cv_fit, newx = x_test, type = "response", s = "lambda.min")
pred_lr_class <- ifelse(pred_lr_proba > 0.5, "Fraud", "Not_Fraud")
pred_lr_class <- factor(pred_lr_class, levels = levels(testing$is_fraud))

cm_lr<-confusionMatrix(pred_lr_class, testing$is_fraud, positive="Fraud")
cm_lr
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not_Fraud Fraud
##   Not_Fraud      1327   365
##   Fraud           188  1121
##                                           
##                Accuracy : 0.8157          
##                  95% CI : (0.8014, 0.8295)
##     No Information Rate : 0.5048          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.631           
##                                           
##  Mcnemar's Test P-Value : 7.194e-14       
##                                           
##             Sensitivity : 0.7544          
##             Specificity : 0.8759          
##          Pos Pred Value : 0.8564          
##          Neg Pred Value : 0.7843          
##              Prevalence : 0.4952          
##          Detection Rate : 0.3735          
##    Detection Prevalence : 0.4362          
##       Balanced Accuracy : 0.8151          
##                                           
##        'Positive' Class : Fraud           
## 

4.4 K-Nearest Neighbour (KNN)

# KNN model fitting
set.seed(123)
knn_model <- train(
  is_fraud ~ amt + gender + city_pop + category + state + job,
  data = training,
  method = "knn",
  trControl = train_control,
  metric = "ROC"
)

pred_knn_class <- predict(knn_model, newdata = testing, type="raw")
cm_knn<-confusionMatrix(pred_knn_class, testing$is_fraud,)
cm_knn
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Not_Fraud Fraud
##   Not_Fraud      1291   128
##   Fraud           224  1358
##                                          
##                Accuracy : 0.8827         
##                  95% CI : (0.8707, 0.894)
##     No Information Rate : 0.5048         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7655         
##                                          
##  Mcnemar's Test P-Value : 4.116e-07      
##                                          
##             Sensitivity : 0.8521         
##             Specificity : 0.9139         
##          Pos Pred Value : 0.9098         
##          Neg Pred Value : 0.8584         
##              Prevalence : 0.5048         
##          Detection Rate : 0.4302         
##    Detection Prevalence : 0.4728         
##       Balanced Accuracy : 0.8830         
##                                          
##        'Positive' Class : Not_Fraud      
## 

4.5 Comparing ML models

  1. ROC curves
library(pROC)

# Convert factor to numeric
pred_dt_class <- as.numeric(pred_dt_class)
pred_rf_class <- as.numeric(pred_rf_class)
pred_lr_class <- as.numeric(pred_lr_class)
pred_knn_class <- as.numeric(pred_knn_class)

# calculate AUC-ROC
roc_dt <- roc(testing$is_fraud, pred_dt_class, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Setting direction: controls < cases
roc_rf <- roc(testing$is_fraud, pred_rf_class, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Setting direction: controls < cases
roc_lr <-roc(testing$is_fraud, pred_lr_proba, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Warning in roc.default(testing$is_fraud, pred_lr_proba, positive = "Fraud"):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
roc_knn <-roc(testing$is_fraud, pred_knn_class, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Setting direction: controls < cases
# Plotting both ROC curves on the same plot
auc_labels <- c(paste0("DT (AUC=", round(roc_dt$auc,4), ")"),
      paste0("RF (AUC=", round(roc_rf$auc, 4), ")"),
      paste0("LR (AUC=", round(roc_lr$auc,4), ")"),
      paste0("KNN (AUC=", round(roc_knn$auc, 4), ")"))

ggroc(list(DT= roc_dt, RF = roc_rf, KNN = roc_knn,LR = roc_lr), legacy.axes=TRUE) +
  ggtitle("Average ROC Curves") + scale_color_discrete(labels=auc_labels)

  1. Accuracy, F1-score, Recall and Sensitivity
# combine all the metrics of different ML models into a table
result_df<-data.frame(
  Model_name=c('DECISION TREE','RANDOMFOREST','LOGISTIC REGRESSION','KNN'),
  Accuracy=c(cm_dt$overall['Accuracy'],cm_rf$overall['Accuracy'],
             cm_lr$overall['Accuracy'],cm_knn$overall['Accuracy']
             ),
  F1_score=c(cm_dt$byClass['F1'],cm_rf$byClass['F1'],
             cm_lr$byClass['F1'],cm_knn$byClass['F1']
  ),
  Recall = c(cm_dt$byClass['Recall'],cm_rf$byClass['Recall'],
             cm_lr$byClass['Recall'],cm_knn$byClass['Recall']
  ),
  Sensitivity=c(cm_dt$byClass['Sensitivity'],cm_rf$byClass['Sensitivity'],
             cm_lr$byClass['Sensitivity'],cm_knn$byClass['Sensitivity']
             )
)
# View(result_df)
print(result_df)
##            Model_name  Accuracy  F1_score    Recall Sensitivity
## 1       DECISION TREE 0.9130290 0.9110126 0.8990579   0.8990579
## 2        RANDOMFOREST 0.9350217 0.9359606 0.9589502   0.9589502
## 3 LOGISTIC REGRESSION 0.8157281 0.8021467 0.7543742   0.7543742
## 4                 KNN 0.8827058 0.8800273 0.8521452   0.8521452
#overall rating
ggplot(result_df,aes(x=Model_name,y=Accuracy, fill=Accuracy)) + geom_bar(stat="identity")

Conclusion

  1. Which machine learning models are better to classify the credit card fraud cases?
  1. Are older customers more likely to participate in credit card fraud compared to younger customers?