Group_5_Data_Wizards

Credit Card Fraud Analysis

Dataset Used:https://www.kaggle.com/datasets/kartik2112/fraud-detection

Project background

According to datasets published by the central bank of Malaysia, there is a growing adoption of users towards credit card facilities. Even in the day and age of digital payments, the ratio of credit card usage in Malaysia is still substantial, covering over 177 Billion Ringgit in 2022 alone.

The increase in adoption not only brought opportunities, it also brought threats alongside it, fraudsters. As the market is huge, it is very attractive for fraudsters to lay their hands on performing credit card fraud. The data on fraud is clear, just in the first 7 months of 2022, Malaysia already lost 415 Million Ringgit to fraud, leading to billions of dollars being lost globally on a yearly basis.

As a result, my team and I intend to contribute towards the cause of fraud identification and detection, via data science methods, by building a prediction model based on credit card transaction information that is provided for training purposes by Sparkov, a dataset that is generated based on real, fraudulent transactions in the US, with its details.

Project impact towards society

This project intends to explore the possibilities of creating a reliable model that detects frauds before intervention becomes too late. This could potentially bring benefits towards authorities, card scheme providers such as Visa, MasterCard, and EuroNet, and banks by building a credit card system that is more safe, reliable, and trusted.

Furthermore, it also helps credit card users avoid a sticky situation where their card is being used fraudulently, leading up to deteriorating credit scores and immeasurable amounts of debt that they could not afford. In addition, having a more secure and fraud-protected credit card environment also increases the user’s confidence in the system, promoting and enhancing the payment services landscape. Finally, a model as such could reduce the number of fraud incidents, which can indirectly tackle the issue of proliferation of funds and terrorist financing, which is a major concern in the financial sector.

Problem Statement

Credit card fraud happens when someone makes unauthorized transactions using someone’s credit card. The usage of credit cards has increased over the years with online transactions and as a cashless payment method. Credit card fraud leads to implications such as financial loss, identity theft and inconvenience to the cardholders.

Project Objectives

To identify and understand patterns associated with fraud, analyze credit card transactions using advanced techniques, aiming to distinguish characteristics and indicators of legitimate transactions from potentially fraudulent ones.

To evaluate the performance and reliability of the fraud prediction model, assess its robustness through rigorous evaluation and validation processes, utilizing metrics and cross-validation on diverse datasets to ensure real-world applicability.

To develop an adaptive fraud detection model, create a machine learning system that dynamically adapts to evolving credit card fraud patterns, ensuring high performance on historical data and effectively identifying new tactics for a proactive defense.

Key Questions

These questions focus on understanding and analyzing the broader context and trends within the dataset, which can be crucial for developing effective strategies to prevent and manage credit card fraud.

Fraud Pattern Identification: What are the common characteristics of fraudulent transactions? This includes understanding the time, location, amount, and frequency of such transactions.
Transaction Analysis: How do the properties of fraudulent transactions differ from those of legitimate ones?
Geographic Analysis: Are there specific regions or locations where fraud is more prevalent?
Temporal Patterns: At what times do fraud incidents peak? Are there patterns related to days of the week, months, or specific events/holidays?
Impact Assessment: What is the financial impact of fraud on both the customers and the institution?

1. Data Pre-Processing

Load the dataset and required packages

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages('tidyverse')

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

suppressPackageStartupMessages(library(tidyverse))
credit_data <- read.csv('credit_card_fraud.csv')
install.packages("dplyr")

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

library(dplyr)
library(ggplot2)
install.packages("ggthemes")

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

library(ggthemes)
library(lubridate)
install.packages("ggmap")

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

library(ggmap)

## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.

register_google(key = "AIzaSyAm-GYWwKz6csxywsNJ0iLGOpun1M5dK7k")
library(grid)
library(broom)
install.packages(c("glmnet", "rpart", "rpart.plot", "ranger", "caret", "caretEnsemble"))

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8

library(rpart)
library(rpart.plot)
library(ranger)
library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(caretEnsemble)

## 
## Attaching package: 'caretEnsemble'
## 
## The following object is masked from 'package:ggplot2':
## 
##     autoplot

install.packages("mlbench")

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

library(mlbench)
library(class)

One of the most important steps in getting raw data ready for analysis or machine learning models is data pre-processing. Several tasks are carried out in the credit card fraud detection R script that is provided. We began by installing packages and loading libraries in R.

We then imported the dataset, checked and ensured correct variable types, and examined summary statistics. Confirming no missing values, we explored the distribution of numeric variables using histograms and log transformations.

str(credit_data) #All columns are of the proper variable type

## 'data.frame':    339607 obs. of  15 variables:
##  $ trans_date_trans_time: chr  "2019-01-01 00:00:44" "2019-01-01 00:00:51" "2019-01-01 00:07:27" "2019-01-01 00:09:03" ...
##  $ merchant             : chr  "Heller, Gutmann and Zieme" "Lind-Buckridge" "Kiehn Inc" "Beier-Hyatt" ...
##  $ category             : chr  "grocery_pos" "entertainment" "grocery_pos" "shopping_pos" ...
##  $ amt                  : num  107.23 220.11 96.29 7.77 6.85 ...
##  $ city                 : chr  "Orient" "Malad City" "Grenada" "High Rolls Mountain Park" ...
##  $ state                : chr  "WA" "ID" "CA" "NM" ...
##  $ lat                  : num  48.9 42.2 41.6 32.9 43 ...
##  $ long                 : num  -118 -112 -123 -106 -111 ...
##  $ city_pop             : int  149 4154 589 899 471 4878 4005 597 46 85 ...
##  $ job                  : chr  "Special educational needs teacher" "Nature conservation officer" "Systems analyst" "Naval architect" ...
##  $ dob                  : chr  "1978-06-21" "1962-01-19" "1945-12-21" "1967-08-30" ...
##  $ trans_num            : chr  "1f76529f8574734946361c461b024d99" "a1a22d70485983eac12b5b88dad1cf95" "413636e759663f264aae1819a4d4f231" "8a6293af5ed278dea14448ded2685fea" ...
##  $ merch_lat            : num  49.2 43.2 41.7 32.9 43.8 ...
##  $ merch_long           : num  -118 -112 -122 -107 -111 ...
##  $ is_fraud             : int  0 0 0 0 0 0 0 0 0 0 ...

summary(credit_data)

##  trans_date_trans_time   merchant           category              amt          
##  Length:339607         Length:339607      Length:339607      Min.   :    1.00  
##  Class :character      Class :character   Class :character   1st Qu.:    9.60  
##  Mode  :character      Mode  :character   Mode  :character   Median :   46.46  
##                                                              Mean   :   70.58  
##                                                              3rd Qu.:   83.35  
##                                                              Max.   :28948.90  
##      city              state                lat             long        
##  Length:339607      Length:339607      Min.   :20.03   Min.   :-165.67  
##  Class :character   Class :character   1st Qu.:36.72   1st Qu.:-120.09  
##  Mode  :character   Mode  :character   Median :39.62   Median :-111.10  
##                                        Mean   :39.72   Mean   :-110.62  
##                                        3rd Qu.:41.71   3rd Qu.:-100.62  
##                                        Max.   :66.69   Max.   : -89.63  
##     city_pop           job                dob             trans_num        
##  Min.   :     46   Length:339607      Length:339607      Length:339607     
##  1st Qu.:    471   Class :character   Class :character   Class :character  
##  Median :   1645   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 107141                                                           
##  3rd Qu.:  35439                                                           
##  Max.   :2383912                                                           
##    merch_lat       merch_long         is_fraud       
##  Min.   :19.03   Min.   :-166.67   Min.   :0.000000  
##  1st Qu.:36.82   1st Qu.:-119.82   1st Qu.:0.000000  
##  Median :39.59   Median :-111.04   Median :0.000000  
##  Mean   :39.72   Mean   :-110.62   Mean   :0.005247  
##  3rd Qu.:42.19   3rd Qu.:-100.35   3rd Qu.:0.000000  
##  Max.   :67.51   Max.   : -88.63   Max.   :1.000000

#No missing data

#Suggestion of outliers in amt and city_pop but this is to be expected. We can take a look anyway at the distribution of those variables.
hist(credit_data$amt) #Let's see the log transformation

hist(log(credit_data$amt)) #Acceptable

The ‘is_fraud’ variable was converted to a factor with custom levels (“No” and “Yes”). Finally, we look into analyzing categorical variables, focusing on the ‘category’ variable’s distribution and its connection to fraud instances through table creation.

#Convert is_fraud into a factor with the labels no and yes
credit_data$is_fraud <- factor(credit_data$is_fraud)
levels(credit_data$is_fraud) <- c("No", "Yes")

table(credit_data$category) #Not exactly sure what the suffices _net and _post mean but we move on

## 
##  entertainment    food_dining  gas_transport    grocery_net    grocery_pos 
##          24222          23038          35089          11355          32732 
## health_fitness           home      kids_pets       misc_net       misc_pos 
##          22593          32516          29704          16898          20024 
##  personal_care   shopping_net   shopping_pos         travel 
##          24406          26379          30329          10322

table(credit_data$category, credit_data$is_fraud) #Yes grocery_pos has the highest number of frauds but we need to look at number of fraud per transaction.

##                 
##                     No   Yes
##   entertainment  24167    55
##   food_dining    23000    38
##   gas_transport  34936   153
##   grocery_net    11328    27
##   grocery_pos    32299   433
##   health_fitness 22557    36
##   home           32466    50
##   kids_pets      29649    55
##   misc_net       16681   217
##   misc_pos       19962    62
##   personal_care  24351    55
##   shopping_net   25998   381
##   shopping_pos   30142   187
##   travel         10289    33

In general, data pre-processing includes operations like handling missing values, converting types, and changing variables to produce a standardized and clean dataset.

2. Exploratory Data Analysis (EDA)

With EDA, our task is to understand the dataset.

Firstly, we start by checking for missing values to ensure that there is no absence of data.

# check for missing values in the dataset
sum(is.na(credit_data))

## [1] 0

Next, we explore the details of the financial transactions, hence we explored the names of the columns.

# display dataset columns 
names(credit_data)

##  [1] "trans_date_trans_time" "merchant"              "category"             
##  [4] "amt"                   "city"                  "state"                
##  [7] "lat"                   "long"                  "city_pop"             
## [10] "job"                   "dob"                   "trans_num"            
## [13] "merch_lat"             "merch_long"            "is_fraud"

Further to that, using the head and tail functions, we uncovered the first and last few rows of the dataset.

# explore the first and last few rows of dataset
head(credit_data)

##   trans_date_trans_time                  merchant      category    amt
## 1   2019-01-01 00:00:44 Heller, Gutmann and Zieme   grocery_pos 107.23
## 2   2019-01-01 00:00:51            Lind-Buckridge entertainment 220.11
## 3   2019-01-01 00:07:27                 Kiehn Inc   grocery_pos  96.29
## 4   2019-01-01 00:09:03               Beier-Hyatt  shopping_pos   7.77
## 5   2019-01-01 00:21:32                Bruen-Yost      misc_pos   6.85
## 6   2019-01-01 00:22:06                 Kunze Inc   grocery_pos  90.22
##                       city state     lat      long city_pop
## 1                   Orient    WA 48.8878 -118.2105      149
## 2               Malad City    ID 42.1808 -112.2620     4154
## 3                  Grenada    CA 41.6125 -122.5258      589
## 4 High Rolls Mountain Park    NM 32.9396 -105.8189      899
## 5                  Freedom    WY 43.0172 -111.0292      471
## 6                  Honokaa    HI 20.0827 -155.4880     4878
##                                 job        dob                        trans_num
## 1 Special educational needs teacher 1978-06-21 1f76529f8574734946361c461b024d99
## 2       Nature conservation officer 1962-01-19 a1a22d70485983eac12b5b88dad1cf95
## 3                   Systems analyst 1945-12-21 413636e759663f264aae1819a4d4f231
## 4                   Naval architect 1967-08-30 8a6293af5ed278dea14448ded2685fea
## 5         Education officer, museum 1967-08-02 f3c43d336e92a44fc2fb67058d5949e3
## 6                   Physiotherapist 1966-12-03 95826e3caa9e0b905294c6dae985aec1
##   merch_lat merch_long is_fraud
## 1  49.15905  -118.1865       No
## 2  43.15070  -112.1545       No
## 3  41.65752  -122.2303       No
## 4  32.86326  -106.5202       No
## 5  43.75373  -111.4549       No
## 6  19.56001  -156.0459       No

tail(credit_data)

##        trans_date_trans_time                       merchant       category
## 339602   2020-12-31 23:57:18 Larkin, Stracke and Greenfelde  entertainment
## 339603   2020-12-31 23:57:56                 Schmidt-Larkin           home
## 339604   2020-12-31 23:58:04      Pouros, Walker and Spence      kids_pets
## 339605   2020-12-31 23:59:07                Reilly and Sons health_fitness
## 339606   2020-12-31 23:59:15                      Rau-Robel      kids_pets
## 339607   2020-12-31 23:59:24                Breitenberg LLC         travel
##          amt               city state     lat      long city_pop
## 339602 46.71 Blairsden-Graeagle    CA 39.8127 -120.6405     1725
## 339603 12.68              Wales    AK 64.7556 -165.6723      145
## 339604 13.02          Greenview    CA 41.5403 -122.9366      308
## 339605 43.77              Luray    MO 40.4931  -91.8912      519
## 339606 86.88            Burbank    WA 46.1966 -118.9017     3684
## 339607  7.99               Mesa    ID 44.6255 -116.4493      129
##                                                  job        dob
## 339602 Chartered legal executive (England and Wales) 1967-05-27
## 339603                      Administrator, education 1939-11-09
## 339604                           Call centre manager 1958-09-20
## 339605                                  Town planner 1966-02-13
## 339606                                      Musician 1981-11-29
## 339607                                  Cartographer 1965-12-15
##                               trans_num merch_lat merch_long is_fraud
## 339602 a7105564935ea3977dc61ff9ced3bf5e  38.96354 -120.45712       No
## 339603 a8310343c189e4a5b6316050d2d6b014  65.62359 -165.18603       No
## 339604 bd7071fd5c9510a5594ee196368ac80e  41.97313 -123.55303       No
## 339605 9b1f753c79894c9f4b71f04581835ada  39.94684  -91.33333       No
## 339606 6c5b7c8add471975aa0fec023b2e8408  46.65834 -119.71505       No
## 339607 14392d723bb7737606b2700ac791b7aa  44.47053 -117.08089       No

Moving next, we tend to find out the different types of category involving spending styles, we looked at the categories transctions.

# explore unique values of category column
unique_category <- unique(credit_data$category)
print(unique_category)

##  [1] "grocery_pos"    "entertainment"  "shopping_pos"   "misc_pos"      
##  [5] "shopping_net"   "gas_transport"  "misc_net"       "grocery_net"   
##  [9] "food_dining"    "health_fitness" "kids_pets"      "home"          
## [13] "personal_care"  "travel"

The statistical summary funstions provides numerical details such as min, median, mean, max values.

# statistical summary
summary(credit_data)

##  trans_date_trans_time   merchant           category              amt          
##  Length:339607         Length:339607      Length:339607      Min.   :    1.00  
##  Class :character      Class :character   Class :character   1st Qu.:    9.60  
##  Mode  :character      Mode  :character   Mode  :character   Median :   46.46  
##                                                              Mean   :   70.58  
##                                                              3rd Qu.:   83.35  
##                                                              Max.   :28948.90  
##      city              state                lat             long        
##  Length:339607      Length:339607      Min.   :20.03   Min.   :-165.67  
##  Class :character   Class :character   1st Qu.:36.72   1st Qu.:-120.09  
##  Mode  :character   Mode  :character   Median :39.62   Median :-111.10  
##                                        Mean   :39.72   Mean   :-110.62  
##                                        3rd Qu.:41.71   3rd Qu.:-100.62  
##                                        Max.   :66.69   Max.   : -89.63  
##     city_pop           job                dob             trans_num        
##  Min.   :     46   Length:339607      Length:339607      Length:339607     
##  1st Qu.:    471   Class :character   Class :character   Class :character  
##  Median :   1645   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 107141                                                           
##  3rd Qu.:  35439                                                           
##  Max.   :2383912                                                           
##    merch_lat       merch_long      is_fraud    
##  Min.   :19.03   Min.   :-166.67   No :337825  
##  1st Qu.:36.82   1st Qu.:-119.82   Yes:  1782  
##  Median :39.59   Median :-111.04               
##  Mean   :39.72   Mean   :-110.62               
##  3rd Qu.:42.19   3rd Qu.:-100.35               
##  Max.   :67.51   Max.   : -88.63

Next, we move our focus to understand the fraudulent transactions, to get an overall impression.

# table view of fraud summary (0: legitimate transaction, 1: fraud transaction)
fraud_summary_table<-table(credit_data$is_fraud)
print(fraud_summary_table)

## 
##     No    Yes 
## 337825   1782

To understand the next fraud pattern, we explore by getting the summary of transactions by state and job.

# table view of fraud summary by state
fraud_state_table <-table(credit_data$state,credit_data$is_fraud)
print(fraud_state_table)

##     
##         No   Yes
##   AK  2913    50
##   AZ 15298    64
##   CA 80093   402
##   CO 19651   115
##   HI  3633    16
##   ID  8002    33
##   MO 54642   262
##   NE 34209   216
##   NM 23306   121
##   OR 26211   197
##   UT 15296    61
##   WA 26914   126
##   WY 27657   119

# table view of fraud summary by job
fraud_job_table <-table(credit_data$job,credit_data$is_fraud)
print(fraud_job_table)

##                                                    
##                                                       No  Yes
##   Accountant, chartered                                0   11
##   Administrator, education                          2188   15
##   Administrator, local government                   1457    9
##   Advertising account planner                       3639    9
##   Aeronautical engineer                             2187   14
##   Agricultural consultant                           2190    5
##   Airline pilot                                     2910    6
##   Architect                                         2910   15
##   Architectural technologist                        1458    8
##   Armed forces training and education officer       2914    5
##   Associate Professor                                729    5
##   Barista                                           2184    9
##   Barrister                                         2909    8
##   Building surveyor                                 1457   12
##   Buyer, industrial                                 2181   11
##   Call centre manager                               2187    7
##   Careers information officer                          0   12
##   Cartographer                                      2910   11
##   Charity officer                                    728   12
##   Chartered legal executive (England and Wales)     1455    9
##   Chartered public finance accountant               2190    0
##   Chemical engineer                                  729   11
##   Chief Marketing Officer                           1457    6
##   Chiropodist                                        730   13
##   Civil engineer, contracting                       2910   19
##   Civil Service administrator                        727   12
##   Civil Service fast streamer                        728   11
##   Clinical cytogeneticist                              0    7
##   Clinical research associate                        728    3
##   Clothing/textile technologist                     3631   15
##   Colour technologist                               2187   29
##   Commissioning editor                                 0   14
##   Community arts worker                             1454    7
##   Comptroller                                        727   11
##   Contractor                                        4364    2
##   Counselling psychologist                          2916   15
##   Counsellor                                        2915    9
##   Cytogeneticist                                    2187    5
##   Dealer                                            3645   19
##   Designer, exhibition/display                      4374    4
##   Development worker, international aid                0   10
##   Early years teacher                               2915    8
##   Economist                                          726   11
##   Editor, magazine features                         2179    8
##   Education administrator                           1457   10
##   Education officer, museum                         2912   15
##   Educational psychologist                          3643   18
##   Electronics engineer                              4369   11
##   Engineer, agricultural                            2186    9
##   Engineer, automotive                              2911    8
##   Engineer, biomedical                              1460   22
##   Engineer, building services                       1458   12
##   Engineer, civil (consulting)                       725    9
##   Engineer, communications                          2190    0
##   Engineer, electronics                              726    7
##   Engineer, maintenance                             2184   14
##   Engineer, petroleum                               1452   13
##   Engineer, production                              2185   11
##   Engineer, site                                       0   12
##   Exercise physiologist                              730    4
##   Fine artist                                          0   10
##   Firefighter                                       3643    7
##   Forensic psychologist                              729   12
##   Freight forwarder                                 2908   12
##   Further education lecturer                        1454   10
##   Futures trader                                    1454    8
##   Geologist, engineering                            1457    9
##   Geoscientist                                      4371   18
##   Glass blower/designer                              729   13
##   Health physicist                                  4371    3
##   Health service manager                            2187    6
##   Historic buildings inspector/conservation officer 3641   12
##   Hotel manager                                      728    9
##   Human resources officer                           2909   23
##   Immigration officer                               2914   11
##   Industrial/product designer                          0   11
##   Information officer                                  0    8
##   Information systems manager                       2187    9
##   Insurance broker                                  5093   15
##   Intelligence analyst                              3636    5
##   Investment analyst                                3644   10
##   Investment banker, corporate                      1454   11
##   IT consultant                                     1454    8
##   Journalist, newspaper                              728    9
##   Land/geomatics surveyor                           5099   20
##   Landscape architect                                  0    9
##   Learning mentor                                   2183   12
##   Lecturer, higher education                        2915    7
##   Licensed conveyancer                              2188   10
##   Local government officer                           728    4
##   Location manager                                  2184    9
##   Magazine features editor                          2181    8
##   Marketing executive                                726   10
##   Materials engineer                                1459   19
##   Medical technical officer                          726   14
##   Mental health nurse                               2187    8
##   Metallurgist                                      1459   13
##   Museum education officer                          1459   13
##   Museum/gallery exhibitions officer                2185    8
##   Music therapist                                   3629   14
##   Musician                                          3644    7
##   Nature conservation officer                        727   16
##   Naval architect                                   3639   24
##   Network engineer                                  2188   24
##   Nurse, children's                                 2914   16
##   Nurse, mental health                              1455    6
##   Occupational hygienist                            1457    8
##   Occupational psychologist                         2913   10
##   Osteopath                                         3640   11
##   Petroleum engineer                                4376    7
##   Pharmacist, hospital                               729    7
##   Physiotherapist                                   1454    9
##   Pilot, airline                                    2907   17
##   Planning and development surveyor                 2185    7
##   Podiatrist                                        2912   13
##   Private music teacher                              728   13
##   Product designer                                  2913   11
##   Product/process development scientist             1452    7
##   Production manager                                3649    9
##   Public house manager                              2908    8
##   Public librarian                                  2185   11
##   Public relations account executive                3629   19
##   Radio broadcast assistant                          729    9
##   Radiographer, diagnostic                           729    5
##   Research officer, political party                 4372    9
##   Research scientist (maths)                        2174   11
##   Research scientist (medical)                         0    8
##   Research scientist (physical sciences)            2916   23
##   Retail merchandiser                               1456   11
##   Sales executive                                   2184   11
##   Sales professional, IT                            2912   14
##   Science writer                                     730    9
##   Scientist, audiological                           3638   21
##   Scientist, marine                                  727   12
##   Scientist, physiological                          2177   12
##   Scientist, research (maths)                       2179    7
##   Set designer                                         0   19
##   Soil scientist                                     729    8
##   Solicitor, Scotland                                729   11
##   Special educational needs teacher                 4355    7
##   Surveyor, land/geomatics                          5831   24
##   Surveyor, minerals                                6564   25
##   Surveyor, mining                                  2177   14
##   Systems analyst                                   4370   28
##   Systems developer                                    0    9
##   Tax inspector                                     4371    8
##   Teacher, adult education                           727   10
##   Teacher, early years/pre                           728   11
##   Teaching laboratory technician                     727    3
##   TEFL teacher                                         0   10
##   Telecommunications researcher                     2913    9
##   Television floor manager                           724    5
##   Television/film/video producer                    1456   14
##   Therapist, art                                    3642    8
##   Therapist, horticultural                           728   10
##   Therapist, music                                  1458    8
##   Therapist, occupational                           2907   15
##   Tourist information centre manager                2912    8
##   Town planner                                      2184   11
##   Video editor                                      1454   12
##   Water engineer                                    4373    2
##   Web designer                                      1458    7
##   Wellsite geologist                                2182   20

We conclude the EDA with the statistical summary of the fraud data. EDA helps to understand the structure and underlying data characteristic.

# Overall Summary by is_fraud
isfraud_summary <- credit_data %>%
  group_by(is_fraud) %>%
  summarize (
    mean_amt = mean(amt),
    median_amt = median(amt),
    max_amt = round(max(amt),2),
    min_amt = min(amt),
    total_transactions = n())
print(isfraud_summary)

## # A tibble: 2 × 6
##   is_fraud mean_amt median_amt max_amt min_amt total_transactions
##   <fct>       <dbl>      <dbl>   <dbl>   <dbl>              <int>
## 1 No           68.2       46.2  28949.    1                337825
## 2 Yes         518.       356.    1372.    1.78               1782

Percentage Fraud by Category

#Summarize the percentage of fraud per category
credit_data_prod_cat <- credit_data %>%
  group_by(category) %>%
  summarize(total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
  mutate(perc_fraud = total_fraud * 100 / total_trans)

credit_data_prod_cat

## # A tibble: 14 × 4
##    category       total_fraud total_trans perc_fraud
##    <chr>                <int>       <int>      <dbl>
##  1 entertainment           55       24222      0.227
##  2 food_dining             38       23038      0.165
##  3 gas_transport          153       35089      0.436
##  4 grocery_net             27       11355      0.238
##  5 grocery_pos            433       32732      1.32 
##  6 health_fitness          36       22593      0.159
##  7 home                    50       32516      0.154
##  8 kids_pets               55       29704      0.185
##  9 misc_net               217       16898      1.28 
## 10 misc_pos                62       20024      0.310
## 11 personal_care           55       24406      0.225
## 12 shopping_net           381       26379      1.44 
## 13 shopping_pos           187       30329      0.617
## 14 travel                  33       10322      0.320

hist(credit_data$city_pop) #Acceptable

3. Data Cleaning

In the data cleaning process, we first made sure that specific columns in ‘credit_data’ were in numeric format to enable numerical operations. Next, we formatted ‘trans_date_trans_time’ as (date-time) and ‘dob’ as Date, ensuring accurate handling of date and time information for future analyses. These steps aimed to standardize the dataset and improve its reliability.

# convert to numeric format
credit_data$amt <- as.numeric(credit_data$amt)
credit_data$lat <- as.numeric(credit_data$lat)
credit_data$long <- as.numeric(credit_data$long)
credit_data$merch_lat <- as.numeric(credit_data$merch_lat)
credit_data$merch_long <- as.numeric(credit_data$merch_long)

# convert trans_date_trans_time & dob variables
credit_data$trans_date_trans_time <- as.POSIXct(credit_data$trans_date_trans_time)
credit_data$dob <- as.Date(credit_data$dob)

4. Data Visualisations

Plot a column chart showing percentage fraud by category

ggplot(credit_data_prod_cat, aes(reorder(category, perc_fraud), perc_fraud, fill = category)) +
  geom_col() +
  xlab("Product Category") +
  ylab("Percentage of fraudulent transactions") +
  labs(title = "Fraudulent transactions by category") +
  theme(legend.position = "none", plot.title = element_text(size = 18)) +
  coord_flip()

Transactions by Hour

library(tidyr)
# Re-import dataset for non-adjusted time column
credit_data_time <- read.csv("credit_card_fraud.csv")

credit_data_time <- credit_data_time %>% 
  separate(trans_date_trans_time, into = c("t_date","t_time"), sep = " ") %>%
  separate(t_time, into = c("t_hour","t_min", "t_sec"), sep = ":")

ggplot(credit_data_time, aes(x = t_hour)) + geom_bar() + labs(title = "Transactions by hour", x = "Transaction Hour", y = "Number of Transaction")

Fraudulent Transactions by Hour

credit_data_time_filtered <- credit_data_time %>% 
  filter(is_fraud == 1)

ggplot(credit_data_time_filtered, aes(x = t_hour)) + geom_bar() + labs(title = "Fraudulent Transactions by hour", x = "Transaction Hour", y = "Number of Fraudulent Transactions")

From the above two graphs, the transactions are evenly distributed across AM (0000 - 1100) and PM hours (1300 - 2300), which are quite regular without much volatility. However, the fraudulent transactions spike between the hours of 2200 to 0300), which covers 1530 of the 1782 observations, which is over 85%.

credit_data_time$dob <- as.Date(credit_data_time$dob)
credit_data_time$t_date <- as.Date(credit_data_time$t_date)
credit_data_time$age <- as.integer(difftime(credit_data_time$t_date, credit_data_time$dob, units = "days") / 365)

credit_data_time <- credit_data_time %>% 
  separate(t_date, into = c("t_year", "t_month", "t_day"), sep = "-", 
          remove = FALSE, convert = TRUE)

ggplot(credit_data_time, aes(x = t_day)) + geom_bar() + labs(title = "Transactions by days of month", x = "Days of the Month", y = "Number of transactions")

credit_data_time_filtered2 <- credit_data_time %>% 
  filter(is_fraud == 1)

ggplot(credit_data_time_filtered2, aes(x = t_day)) + geom_bar() + labs(title = "Fraudulent Transactions by days of month", x = "Days of the Month", y = "Number of Fraudulent Transactions")

From the graphs showing the days of the month, we could see there are few peaks that could help us identify dates that are high in fraudulent transactions, which happens to be the starting of the month, mid of the month, and the last day of the month. Although there are peaks, it is not clear enough.

ggplot(credit_data_time, aes(x = age)) + geom_bar() + labs(title = "Transactions by Age", x = "Age", y = "Number of transactions")

credit_data_time_filtered3 <- credit_data_time %>% 
  filter(is_fraud == 1)
ggplot(credit_data_time_filtered3, aes(x = age)) + geom_bar() + labs(title = "Fraudulent Transactions by Age", x = "Age", y = "Number of Fraudulent Transactions")

From the above two graphs, we can notice there isn’t too much difference between the distribution of transactions and the fraudulent ones, which suggests that the factors that could contribute to our prediction model lies elsewhere.

Box plot of transaction amount by fraud status

We will log transform the amount for the plot because of the presence of outliers and skewed distribution of money variables

ggplot(credit_data, aes(x = amt, y = factor(is_fraud))) +
  geom_boxplot() +
  labs(title = "Transaction amount by fraud status (Log transformed)") +
  xlab("Amount") +
  ylab("Fraud status") +
  scale_y_discrete(labels = c("Not fraud", "Fraud")) +
  scale_x_log10() +
  theme(plot.title = element_text(size = 18)) +
  coord_flip()

wilcox.test(amt ~ factor(is_fraud), data = credit_data) #p-value is less than 0.001.

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  amt by factor(is_fraud)
## W = 102098032, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

#We reject the null hypothesis and conclude that there is evidence

Summarise the top 15 based on percentage of fraud per merchant and retain the top 15

credit_data_merchant <- credit_data %>%
  group_by(merchant) %>%
  summarize(total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
  mutate(perc_fraud = total_fraud * 100 / total_trans) %>%
  top_n(15, perc_fraud)

credit_data_merchant #Top 15 merchants with the most fraudulent transactions ranging from 2.05 to 3.3%.

## # A tibble: 15 × 4
##    merchant                             total_fraud total_trans perc_fraud
##    <chr>                                      <int>       <int>      <dbl>
##  1 Gottlieb, Considine and Schultz               12         566       2.12
##  2 Kerluke Inc                                    7         326       2.15
##  3 Kerluke-Abshire                               17         515       3.30
##  4 Kiehn-Emmerich                                19         658       2.89
##  5 Kunze Inc                                     16         645       2.48
##  6 Kutch and Sons                                13         632       2.06
##  7 Lebsack and Sons                               8         358       2.23
##  8 Moore, Dibbert and Koepp                       8         337       2.37
##  9 Rempel Inc                                    11         523       2.10
## 10 Romaguera, Cruickshank and Greenholt          18         521       3.45
## 11 Schultz, Simonis and Little                   13         622       2.09
## 12 Strosin-Cruickshank                           14         671       2.09
## 13 Terry-Huel                                    13         519       2.50
## 14 Tillman, Fritsch and Schmitt                   9         364       2.47
## 15 Welch Inc                                      8         342       2.34

Column chart showing percentage fraud by Merchant (top 15)

#Plot a 
ggplot(credit_data_merchant, aes(reorder(merchant, perc_fraud), perc_fraud, fill = merchant)) +
  geom_col() +
  xlab("Merchant name") +
  ylab("Percentage of fraudulent transactions") +
  labs(title = "Fraudulent transactions by merchant name") +
  theme(legend.position = "none",
        plot.title = element_text(size = 12)) +
  coord_flip()

Summarize the percentage of fraud per city with at least 100 transactions and retain the top 15

credit_data_city <- credit_data %>%
  group_by(city) %>%
  summarize(total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
  filter(total_trans >= 100) %>%
  mutate(perc_fraud = total_fraud * 100 / total_trans) %>%
  top_n(15, perc_fraud)

credit_data_city #Top 15 cities with the most fraudulent transactions ranging from 1.49 to 3.07%

## # A tibble: 15 × 4
##    city                      total_fraud total_trans perc_fraud
##    <chr>                           <int>       <int>      <dbl>
##  1 Albuquerque                        24        1479       1.62
##  2 Aurora                             23         750       3.07
##  3 Azusa                              11         739       1.49
##  4 Bay City                           15         744       2.02
##  5 Brashear                           13         741       1.75
##  6 Jordan Valley                      11         737       1.49
##  7 Louisiana                          11         739       1.49
##  8 Loving                             11         737       1.49
##  9 Monitor                            14         740       1.89
## 10 Owensville                         13         741       1.75
## 11 Sprague                            13         743       1.75
## 12 Valentine                          12         741       1.62
## 13 Westfir                            12         741       1.62
## 14 Williamsburg                       13         742       1.75
## 15 Yellowstone National Park          12         742       1.62

Column chart showing percentage fraud by city (top 15)

ggplot(credit_data_city, aes(reorder(city, perc_fraud), perc_fraud, fill = city)) +
  geom_col() +
  xlab("City name") +
  ylab("Percentage of fraudulent transactions") +
  labs(title = "Fraudulent transactions by city name") +
  theme(legend.position = "none") +
  coord_flip()

Extract transaction date and time

install.packages('lubridate')

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

library(lubridate)
# Extract the trans_date from trans_date_trans_time
credit_data_age <- credit_data %>%
  mutate(trans_date = as.Date(trans_date_trans_time)) %>%
  # Then calculate age in years
  mutate(age = floor(time_length(interval(dob, trans_date), "years"))) %>%
  # Deselect the unneeded date columns
  select(-trans_date_trans_time, -trans_date)

head(credit_data_age)

##                    merchant      category    amt                     city state
## 1 Heller, Gutmann and Zieme   grocery_pos 107.23                   Orient    WA
## 2            Lind-Buckridge entertainment 220.11               Malad City    ID
## 3                 Kiehn Inc   grocery_pos  96.29                  Grenada    CA
## 4               Beier-Hyatt  shopping_pos   7.77 High Rolls Mountain Park    NM
## 5                Bruen-Yost      misc_pos   6.85                  Freedom    WY
## 6                 Kunze Inc   grocery_pos  90.22                  Honokaa    HI
##       lat      long city_pop                               job        dob
## 1 48.8878 -118.2105      149 Special educational needs teacher 1978-06-21
## 2 42.1808 -112.2620     4154       Nature conservation officer 1962-01-19
## 3 41.6125 -122.5258      589                   Systems analyst 1945-12-21
## 4 32.9396 -105.8189      899                   Naval architect 1967-08-30
## 5 43.0172 -111.0292      471         Education officer, museum 1967-08-02
## 6 20.0827 -155.4880     4878                   Physiotherapist 1966-12-03
##                          trans_num merch_lat merch_long is_fraud age
## 1 1f76529f8574734946361c461b024d99  49.15905  -118.1865       No  40
## 2 a1a22d70485983eac12b5b88dad1cf95  43.15070  -112.1545       No  56
## 3 413636e759663f264aae1819a4d4f231  41.65752  -122.2303       No  73
## 4 8a6293af5ed278dea14448ded2685fea  32.86326  -106.5202       No  51
## 5 f3c43d336e92a44fc2fb67058d5949e3  43.75373  -111.4549       No  51
## 6 95826e3caa9e0b905294c6dae985aec1  19.56001  -156.0459       No  52

Box Plot of ages by fraud status

#Box plot of age by fraud status
ggplot(credit_data_age, aes(x = age, y = factor(is_fraud))) +
  geom_boxplot() +
  labs(title = "Transaction amount by age") +
  xlab("Age") +
  ylab("Fraud status") +
  scale_y_discrete(labels = c("Not fraud", "Fraud")) +
  coord_flip()

#breaking age to over 50 and under 50
credit_data_age <- credit_data_age %>%
  mutate(age_cat = ifelse(age >= 50, "Over 50", "Under 50"))
head(credit_data_age)

##                    merchant      category    amt                     city state
## 1 Heller, Gutmann and Zieme   grocery_pos 107.23                   Orient    WA
## 2            Lind-Buckridge entertainment 220.11               Malad City    ID
## 3                 Kiehn Inc   grocery_pos  96.29                  Grenada    CA
## 4               Beier-Hyatt  shopping_pos   7.77 High Rolls Mountain Park    NM
## 5                Bruen-Yost      misc_pos   6.85                  Freedom    WY
## 6                 Kunze Inc   grocery_pos  90.22                  Honokaa    HI
##       lat      long city_pop                               job        dob
## 1 48.8878 -118.2105      149 Special educational needs teacher 1978-06-21
## 2 42.1808 -112.2620     4154       Nature conservation officer 1962-01-19
## 3 41.6125 -122.5258      589                   Systems analyst 1945-12-21
## 4 32.9396 -105.8189      899                   Naval architect 1967-08-30
## 5 43.0172 -111.0292      471         Education officer, museum 1967-08-02
## 6 20.0827 -155.4880     4878                   Physiotherapist 1966-12-03
##                          trans_num merch_lat merch_long is_fraud age  age_cat
## 1 1f76529f8574734946361c461b024d99  49.15905  -118.1865       No  40 Under 50
## 2 a1a22d70485983eac12b5b88dad1cf95  43.15070  -112.1545       No  56  Over 50
## 3 413636e759663f264aae1819a4d4f231  41.65752  -122.2303       No  73  Over 50
## 4 8a6293af5ed278dea14448ded2685fea  32.86326  -106.5202       No  51  Over 50
## 5 f3c43d336e92a44fc2fb67058d5949e3  43.75373  -111.4549       No  51  Over 50
## 6 95826e3caa9e0b905294c6dae985aec1  19.56001  -156.0459       No  52  Over 50

Summarise mean latitude and longitude for each state

#Summarize the mean latitude, mean longitude of each state
#Then summarize the total fraud and total transactions
credit_data_state <- credit_data %>%
  group_by(state) %>%
  summarize(lat = mean(lat), long = mean(long), total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
  #Finally calculate the percentage fraud
  mutate(perc_fraud = total_fraud * 100 / total_trans)

credit_data_state #Highest percentage of fraud in Alaska

## # A tibble: 13 × 6
##    state   lat   long total_fraud total_trans perc_fraud
##    <chr> <dbl>  <dbl>       <int>       <int>      <dbl>
##  1 AK     65.0 -163.           50        2963      1.69 
##  2 AZ     33.7 -112.           64       15362      0.417
##  3 CA     36.7 -120.          402       80495      0.499
##  4 CO     39.5 -105.          115       19766      0.582
##  5 HI     20.0 -155.           16        3649      0.438
##  6 ID     44.3 -116.           33        8035      0.411
##  7 MO     38.7  -92.7         262       54904      0.477
##  8 NE     41.2  -98.3         216       34425      0.627
##  9 NM     35.1 -106.          121       23427      0.516
## 10 OR     44.7 -122.          197       26408      0.746
## 11 UT     39.5 -111.           61       15357      0.397
## 12 WA     47.6 -120.          126       27040      0.466
## 13 WY     42.4 -108.          119       27776      0.428

#Because Alaska and Hawaii are geographically far from the other locations, 
#it will be tricky to visualize  those two states in the same map with the others.
#We can plot those states separately and apply them as inserts into the main map.

5. Data Model Prediction

Prepare dataset for modelling

credit_data_prep <- credit_data_age %>%
  select(category, amt, state, age_cat, is_fraud)
head(credit_data_prep)

##        category    amt state  age_cat is_fraud
## 1   grocery_pos 107.23    WA Under 50       No
## 2 entertainment 220.11    ID  Over 50       No
## 3   grocery_pos  96.29    CA  Over 50       No
## 4  shopping_pos   7.77    NM  Over 50       No
## 5      misc_pos   6.85    WY  Over 50       No
## 6   grocery_pos  90.22    HI  Over 50       No

Dataset observations is too large and runs really slowly. Randomly retain 25% of the observations for use.

credit_data_prep_reduced <- credit_data_prep %>%
  slice_sample(prop = 0.25)
str(credit_data_prep_reduced)

## 'data.frame':    84901 obs. of  5 variables:
##  $ category: chr  "personal_care" "shopping_pos" "food_dining" "grocery_pos" ...
##  $ amt     : num  44.84 2.13 37.14 111.45 60.55 ...
##  $ state   : chr  "CA" "OR" "MO" "CA" ...
##  $ age_cat : chr  "Over 50" "Over 50" "Over 50" "Over 50" ...
##  $ is_fraud: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

Split data into training and test

set.seed(44)
rows <- nrow(credit_data_prep_reduced)
credit_data_prep_reduced_shuffled <- credit_data_prep_reduced[sample(rows), ]
split <- round(0.8 * rows)
train <- credit_data_prep_reduced_shuffled[1:split, ]
test <- credit_data_prep_reduced_shuffled[(split + 1):rows, ]
nrow(train)

## [1] 67921

nrow(test)

## [1] 16980

nrow(credit_data_prep_reduced_shuffled)

## [1] 84901

install.packages('caret')

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

Create training control

DECISION TREE MODEL

Understanding the Decision Tree Structure

Transaction Amount: The model places significant emphasis on the transaction amount, suggesting that transactions of lower amounts are less likely to be fraudulent.

Category Importance: The categorygrocery_pos feature is only examined for transactions less than 272, indicating it’s a significant predictor of fraud within lower-amount transactions.

Age Factor: For very high amounts (1134 and above), age becomes a relevant factor, with younger cardholders’ transactions being more likely to be classified as fraudulent.

Model Certainty: The leaf nodes show the model’s certainty about its predictions. For instance, transactions that are not categorized as ‘grocery_pos’ and have amt < 272 are very likely not to be fraudulent.

library(caret)
myControl <- trainControl(
  method = "cv",
  number = 2,
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  verboseIter = TRUE
)

str(train)

## 'data.frame':    67921 obs. of  5 variables:
##  $ category: chr  "gas_transport" "shopping_net" "kids_pets" "misc_pos" ...
##  $ amt     : num  45.02 78.48 42.55 1.89 6.68 ...
##  $ state   : chr  "AZ" "NE" "OR" "AZ" ...
##  $ age_cat : chr  "Over 50" "Under 50" "Under 50" "Over 50" ...
##  $ is_fraud: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

install.packages('rpart')

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

install.packages('rpart.plot')

## 
## The downloaded binary packages are in
##  /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages

set.seed(40)
rpart_model <- train(
  is_fraud ~ .,
  data = train,
  method = "rpart",
  trControl = myControl
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## + Fold1: cp=0.03179 
## - Fold1: cp=0.03179 
## + Fold2: cp=0.03179 
## - Fold2: cp=0.03179 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0318 on full training set

library(rpart.plot)
rpart.plot(rpart_model$finalModel, type = 2, fallen.leaves = TRUE)

rpart_model

## CART 
## 
## 67921 samples
##     4 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 33961, 33960 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens       Spec      
##   0.03179191  0.8655146  0.9993045  0.43930636
##   0.04046243  0.8651605  0.9995265  0.33236994
##   0.06840077  0.6809043  0.9999852  0.07514451
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03179191.

test$is_fraud_rpart <- predict(rpart_model, test)
table(test$is_fraud, test$is_fraud_rpart)

##      
##          No   Yes
##   No  16869    20
##   Yes    40    51

RANDOM FOREST MODEL

R Code and Model Training:

The train function from the caret package is used to train a Random Forest model on the training data to predict the is_fraud outcome. This function automates the process of model training including parameter tuning and cross-validation.
set.seed(40) ensures reproducibility of the results by setting the random number generator to a fixed point.
importance = “impurity” indicates that variable importance is assessed based on impurity measures which help in understanding which variables contribute most to the decision-making process.
trControl is a list of options that controls the training process, including cross-validation settings and the summary function used.

Plot and Results:

The plot illustrates the performance of the Random Forest model using two different splitting rules (gini and extratrees) across different numbers of randomly selected predictors at each split (mtry).
The performance metric used is the ROC (Receiver Operating Characteristic) curve, which is a plot of the true positive rate (sensitivity) against the false positive rate (1-specificity) for various threshold settings.

set.seed(40)
rf_model <- train(
  is_fraud ~ .,
  data = train,
  method = "ranger",
  importance = "impurity",
  trControl = myControl
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## + Fold1: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold1: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold1: mtry=14, min.node.size=1, splitrule=gini 
## - Fold1: mtry=14, min.node.size=1, splitrule=gini 
## + Fold1: mtry=27, min.node.size=1, splitrule=gini 
## - Fold1: mtry=27, min.node.size=1, splitrule=gini 
## + Fold1: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold1: mtry=14, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry=14, min.node.size=1, splitrule=extratrees 
## + Fold1: mtry=27, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry=27, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry= 2, min.node.size=1, splitrule=gini 
## - Fold2: mtry= 2, min.node.size=1, splitrule=gini 
## + Fold2: mtry=14, min.node.size=1, splitrule=gini 
## - Fold2: mtry=14, min.node.size=1, splitrule=gini 
## + Fold2: mtry=27, min.node.size=1, splitrule=gini 
## - Fold2: mtry=27, min.node.size=1, splitrule=gini 
## + Fold2: mtry= 2, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry= 2, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=14, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry=14, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=27, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry=27, min.node.size=1, splitrule=extratrees 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27, splitrule = extratrees, min.node.size = 1 on full training set

plot(rf_model)

rf_model$results

##   mtry min.node.size  splitrule       ROC      Sens      Spec        ROCSD
## 1    2             1       gini 0.9104731 1.0000000 0.0000000 0.0008478298
## 2    2             1 extratrees 0.8958925 1.0000000 0.0000000 0.0015622869
## 3   14             1       gini 0.9371629 0.9989789 0.5260116 0.0102468095
## 4   14             1 extratrees 0.9258183 0.9995265 0.3872832 0.0101794843
## 5   27             1       gini 0.9231629 0.9984610 0.5404624 0.0063236860
## 6   27             1 extratrees 0.9386366 0.9983722 0.5231214 0.0004512699
##         SensSD     SpecSD
## 1 0.000000e+00 0.00000000
## 2 0.000000e+00 0.00000000
## 3 1.883739e-04 0.05722251
## 4 4.184621e-05 0.05722251
## 5 8.368003e-05 0.02043661
## 6 2.092465e-04 0.06130984

#check on the tests
test$is_fraud_rf <- predict(rf_model, test)
table(test$is_fraud, test$is_fraud_rf)

##      
##          No   Yes
##   No  16864    25
##   Yes    35    56

Interpreting the Plot:

The x-axis shows the number of variables randomly sampled as candidates at each split (mtry), ranging from 5 to 25.
The y-axis shows the ROC score obtained from cross-validation. ROC scores range from 0.5 (no better than random chance) to 1.0 (perfect classification).
There are two lines representing the two splitting rules used: gini and extratrees. The Gini impurity measure is a traditional choice that measures how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. extratrees is a variation of the usual Random Forest algorithm that makes the trees more random by also using random thresholds for each feature rather than the best split, among other changes.
The plot shows that for both splitting rules, the performance first increases with the number of predictors used and then decreases or stabilizes after a certain point. This indicates that there is an optimal range of mtry values that balance the model’s bias and variance, leading to better generalization.

Model Selection and Performance:

The selected model uses mtry = 14 with the gini split rule, which means it samples 14 predictors at each split and uses the Gini impurity measure for making splits.
The ROC score for the selected model is approximately 0.945, indicating a strong predictive ability.
Sensitivity (true positive rate) is almost perfect, while specificity varies, reflecting that the model is better at identifying fraudulent transactions than non-fraudulent ones.

Confusion Matrix:

The confusion matrix shows the number of correct and incorrect predictions:
No (Non-Fraud): 16883 true negatives, 12 false positives.
Yes (Fraud): 38 false negatives, 47 true positives.

This matrix can be used to calculate various performance metrics, including accuracy, precision, recall, and F1-score.

Summary:

The Random Forest model seems to perform well, with high sensitivity and a good ROC score, indicating its effectiveness in classifying transactions as fraudulent or not.
The tuning of the mtry parameter shows that more predictors are not always better, and a middle ground yields the best cross-validation performance.
While the model has a high sensitivity, the number of false negatives (fraudulent transactions missed by the model) should be reduced as much as possible, considering the context of fraud detection.

K Nearest Neighbour

How KNN Works:

Distance Measurement: KNN calculates the distance between the test observation and all observations in the training set. Common distance measures are Euclidean, Manhattan, or Hamming distance.

set.seed(40)
knn_model <- train(
  is_fraud ~ .,
  data = train,
  method = "knn",
  trControl = myControl
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## + Fold1: k=5 
## - Fold1: k=5 
## + Fold1: k=7 
## - Fold1: k=7 
## + Fold1: k=9 
## - Fold1: k=9 
## + Fold2: k=5 
## - Fold2: k=5 
## + Fold2: k=7 
## - Fold2: k=7 
## + Fold2: k=9 
## - Fold2: k=9 
## Aggregating results
## Selecting tuning parameters
## Fitting k = 9 on full training set

#KNN Results
knn_model$results

##   k       ROC      Sens      Spec       ROCSD       SensSD      SpecSD
## 1 5 0.8411824 0.9988013 0.2225434 0.014134239 2.090297e-05 0.012261967
## 2 7 0.8563011 0.9988457 0.2196532 0.007908128 4.188027e-05 0.008174645
## 3 9 0.8657195 0.9989197 0.1994220 0.006256852 2.302313e-04 0.004087322

test$is_fraud_knn <- predict(knn_model, test)
table(test$is_fraud, test$is_fraud_knn)

##      
##          No   Yes
##   No  16868    21
##   Yes    68    23

Interpretation of Warnings and Output:

The warning indicates that “Accuracy” was not found, so ROC (a measure of how well the model distinguishes between classes) was used as the performance metric instead.
The process includes cross-validation, which is denoted by “+ Fold1: k=5” and similar lines, indicating that the model was trained and evaluated with different ‘k’ values (5, 7, 9) on different subsets of the data (folds).
The model was finally fitted with k = 9 on the full training set based on the selection criteria, which could be the highest ROC value observed during cross-validation.

Confusion Matrix:

The confusion matrix for k = 9 shows that there were 16884 true negatives and 11 false positives. However, there were 76 false negatives and only 9 true positives, suggesting that while the model is quite good at identifying genuine transactions (high true negatives), it struggles with identifying fraudulent transactions (low true positives and higher false negatives).

Data Frame Results:

Each row in the resulting data frame corresponds to a different ‘k’ value used during the training.
ROC is the average ROC score from cross-validation.
Sens and Spec are the sensitivity (true positive rate) and specificity (true negative rate), respectively.
ROCSD, SensSD, and SpecSD are the standard deviations of the ROC, sensitivity, and specificity scores across the cross-validation folds.

Summary:

The KNN model with k = 9 seems to have the highest ROC value of approximately 0.851, but its sensitivity is quite low, indicated by the number of false negatives.
The sensitivity decreases as ‘k’ increases, which suggests that the model becomes more conservative with a larger ‘k’. The model with k = 9 is, therefore, less likely to predict fraud, perhaps due to the imbalanced nature of the dataset where non-fraudulent cases are much more common than fraudulent ones.
The low sensitivity (true positive rate) could be a significant issue in fraud detection because it means the model is missing a lot of fraudulent transactions. This is something you would need to address, possibly by using different features, resampling techniques to balance the classes, or a different threshold for classifying an observation as fraud.

Compare models

model_list <- list(
  rpart = rpart_model,
  rf = rf_model,
  knn = knn_model
)
resamps <- resamples(model_list)
summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: rpart, rf, knn 
## Number of resamples: 2 
## 
## ROC 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.8623982 0.8639564 0.8655146 0.8655146 0.8670728 0.8686309    0
## rf    0.9383175 0.9384771 0.9386366 0.9386366 0.9387962 0.9389557    0
## knn   0.8612953 0.8635074 0.8657195 0.8657195 0.8679317 0.8701438    0
## 
## Sens 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.9992601 0.9992823 0.9993045 0.9993045 0.9993267 0.9993489    0
## rf    0.9982242 0.9982982 0.9983722 0.9983722 0.9984462 0.9985201    0
## knn   0.9987569 0.9988383 0.9989197 0.9989197 0.9990011 0.9990825    0
## 
## Spec 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart 0.3641618 0.4017341 0.4393064 0.4393064 0.4768786 0.5144509    0
## rf    0.4797688 0.5014451 0.5231214 0.5231214 0.5447977 0.5664740    0
## knn   0.1965318 0.1979769 0.1994220 0.1994220 0.2008671 0.2023121    0

dotplot(resamps, metric = "Sens")

dotplot(resamps, metric = "Spec")

ROC (Receiver Operating Characteristic):

Random Forest has the highest ROC across all statistics, consistently scoring above 0.94. This indicates a superior ability to distinguish between fraudulent and non-fraudulent transactions with a low rate of false positives and false negatives.
Decision Tree (rpart) shows variability in ROC, with scores ranging from about 0.82 to 0.95. While the upper end is promising, the lower end suggests some inconsistency.
K-Nearest Neighbors (knn) has the lowest ROC scores, which implies it is less capable of correctly distinguishing between the two classes compared to the other models.

Sensitivity (True Positive Rate):

All models exhibit high sensitivity, with Random Forest marginally outperforming the others. Sensitivity is nearly perfect for all models, suggesting that they are all adept at identifying actual frauds.

Specificity (True Negative Rate):

There is a more pronounced difference in specificity. Random Forest again has the highest specificity, indicating it is best at correctly identifying legitimate transactions.
Decision Tree has moderate specificity, which is acceptable but still means more legitimate transactions may be incorrectly flagged as fraud compared to the Random Forest model.
K-Nearest Neighbors has very low specificity. This means it is likely to generate many false positives, flagging legitimate transactions as fraudulent, which could lead to increased operational costs and customer dissatisfaction.

Best Model Selection:

The Random Forest model is chosen as the best model based on its superior performance across all key metrics – ROC, Sensitivity, and Specificity.

The model that offers a balance between catching frauds (high sensitivity) and not disturbing genuine transactions (high specificity). The Random Forest model provides this balance better than the other models tested.
The implications of implementing the Random Forest model in a live system, such as reduced manual review workload due to fewer false positives and enhanced customer trust due to accurately flagged transactions.
The Random Forest model’s robustness to different types of data and its ability to handle a large number of input variables make it a versatile and reliable choice for fraud detection.

6. Data Product

Credit card fraud detection using machine learning, hosted on Streamlit, offers an interactive and user-friendly platform for identifying and preventing fraudulent credit card transactions. This tool leverages advanced machine learning algorithms to analyze transaction data in real-time, pinpointing patterns that indicate fraud. Hosted on Streamlit, it provides an intuitive web interface where users can interact with the model.

Demo Data Product : Streamlit - Credit Card Prediction

7. Conclusion

Overall, each and every part of the project contains significant weight and contributes towards the final outcome. From the data pre-processing to the cleaning, to the visualisations, modelling, and finally coming up with a data product. Our visuals and models has answered the questions that are set out in the beginning of the project, regarding Credit Card Fraud Prediction. With the use of R programming and its handy packages and machine learning models. Through consistent refining and re-evaluating of the process and the models, a more accurate prediction model could be achieved in the future.

Group_5_Data_Wizards_WQD7004

2024-01-11

Credit Card Fraud Analysis

Project background

Project impact towards society

Problem Statement

Project Objectives

Key Questions

1. Data Pre-Processing

2. Exploratory Data Analysis (EDA)

Percentage Fraud by Category

3. Data Cleaning

4. Data Visualisations

Plot a column chart showing percentage fraud by category

Transactions by Hour

Fraudulent Transactions by Hour

Box plot of transaction amount by fraud status

Summarise the top 15 based on percentage of fraud per merchant and retain the top 15

Column chart showing percentage fraud by Merchant (top 15)

Summarize the percentage of fraud per city with at least 100 transactions and retain the top 15

Column chart showing percentage fraud by city (top 15)

Extract transaction date and time

Box Plot of ages by fraud status

Summarise mean latitude and longitude for each state

5. Data Model Prediction

Prepare dataset for modelling

Split data into training and test

Create training control

DECISION TREE MODEL

RANDOM FOREST MODEL

K Nearest Neighbour

Compare models

6. Data Product

7. Conclusion