Dataset Used:https://www.kaggle.com/datasets/kartik2112/fraud-detection
According to datasets published by the central bank of Malaysia, there is a growing adoption of users towards credit card facilities. Even in the day and age of digital payments, the ratio of credit card usage in Malaysia is still substantial, covering over 177 Billion Ringgit in 2022 alone.
The increase in adoption not only brought opportunities, it also brought threats alongside it, fraudsters. As the market is huge, it is very attractive for fraudsters to lay their hands on performing credit card fraud. The data on fraud is clear, just in the first 7 months of 2022, Malaysia already lost 415 Million Ringgit to fraud, leading to billions of dollars being lost globally on a yearly basis.
As a result, my team and I intend to contribute towards the cause of fraud identification and detection, via data science methods, by building a prediction model based on credit card transaction information that is provided for training purposes by Sparkov, a dataset that is generated based on real, fraudulent transactions in the US, with its details.
This project intends to explore the possibilities of creating a reliable model that detects frauds before intervention becomes too late. This could potentially bring benefits towards authorities, card scheme providers such as Visa, MasterCard, and EuroNet, and banks by building a credit card system that is more safe, reliable, and trusted.
Furthermore, it also helps credit card users avoid a sticky situation where their card is being used fraudulently, leading up to deteriorating credit scores and immeasurable amounts of debt that they could not afford. In addition, having a more secure and fraud-protected credit card environment also increases the user’s confidence in the system, promoting and enhancing the payment services landscape. Finally, a model as such could reduce the number of fraud incidents, which can indirectly tackle the issue of proliferation of funds and terrorist financing, which is a major concern in the financial sector.
Credit card fraud happens when someone makes unauthorized transactions using someone’s credit card. The usage of credit cards has increased over the years with online transactions and as a cashless payment method. Credit card fraud leads to implications such as financial loss, identity theft and inconvenience to the cardholders.
To identify and understand patterns associated with fraud, analyze credit card transactions using advanced techniques, aiming to distinguish characteristics and indicators of legitimate transactions from potentially fraudulent ones.
To evaluate the performance and reliability of the fraud prediction model, assess its robustness through rigorous evaluation and validation processes, utilizing metrics and cross-validation on diverse datasets to ensure real-world applicability.
To develop an adaptive fraud detection model, create a machine learning system that dynamically adapts to evolving credit card fraud patterns, ensuring high performance on historical data and effectively identifying new tactics for a proactive defense.
These questions focus on understanding and analyzing the broader context and trends within the dataset, which can be crucial for developing effective strategies to prevent and manage credit card fraud.
Fraud Pattern Identification: What are the common characteristics of fraudulent transactions? This includes understanding the time, location, amount, and frequency of such transactions.
Transaction Analysis: How do the properties of fraudulent transactions differ from those of legitimate ones?
Geographic Analysis: Are there specific regions or locations where fraud is more prevalent?
Temporal Patterns: At what times do fraud incidents peak? Are there patterns related to days of the week, months, or specific events/holidays?
Impact Assessment: What is the financial impact of fraud on both the customers and the institution?
Load the dataset and required packages
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages('tidyverse')
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
suppressPackageStartupMessages(library(tidyverse))
credit_data <- read.csv('credit_card_fraud.csv')
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
library(dplyr)
library(ggplot2)
install.packages("ggthemes")
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
library(ggthemes)
library(lubridate)
install.packages("ggmap")
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
register_google(key = "AIzaSyAm-GYWwKz6csxywsNJ0iLGOpun1M5dK7k")
library(grid)
library(broom)
install.packages(c("glmnet", "rpart", "rpart.plot", "ranger", "caret", "caretEnsemble"))
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(rpart)
library(rpart.plot)
library(ranger)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(caretEnsemble)
##
## Attaching package: 'caretEnsemble'
##
## The following object is masked from 'package:ggplot2':
##
## autoplot
install.packages("mlbench")
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
library(mlbench)
library(class)
One of the most important steps in getting raw data ready for analysis or machine learning models is data pre-processing. Several tasks are carried out in the credit card fraud detection R script that is provided. We began by installing packages and loading libraries in R.
We then imported the dataset, checked and ensured correct variable types, and examined summary statistics. Confirming no missing values, we explored the distribution of numeric variables using histograms and log transformations.
str(credit_data) #All columns are of the proper variable type
## 'data.frame': 339607 obs. of 15 variables:
## $ trans_date_trans_time: chr "2019-01-01 00:00:44" "2019-01-01 00:00:51" "2019-01-01 00:07:27" "2019-01-01 00:09:03" ...
## $ merchant : chr "Heller, Gutmann and Zieme" "Lind-Buckridge" "Kiehn Inc" "Beier-Hyatt" ...
## $ category : chr "grocery_pos" "entertainment" "grocery_pos" "shopping_pos" ...
## $ amt : num 107.23 220.11 96.29 7.77 6.85 ...
## $ city : chr "Orient" "Malad City" "Grenada" "High Rolls Mountain Park" ...
## $ state : chr "WA" "ID" "CA" "NM" ...
## $ lat : num 48.9 42.2 41.6 32.9 43 ...
## $ long : num -118 -112 -123 -106 -111 ...
## $ city_pop : int 149 4154 589 899 471 4878 4005 597 46 85 ...
## $ job : chr "Special educational needs teacher" "Nature conservation officer" "Systems analyst" "Naval architect" ...
## $ dob : chr "1978-06-21" "1962-01-19" "1945-12-21" "1967-08-30" ...
## $ trans_num : chr "1f76529f8574734946361c461b024d99" "a1a22d70485983eac12b5b88dad1cf95" "413636e759663f264aae1819a4d4f231" "8a6293af5ed278dea14448ded2685fea" ...
## $ merch_lat : num 49.2 43.2 41.7 32.9 43.8 ...
## $ merch_long : num -118 -112 -122 -107 -111 ...
## $ is_fraud : int 0 0 0 0 0 0 0 0 0 0 ...
summary(credit_data)
## trans_date_trans_time merchant category amt
## Length:339607 Length:339607 Length:339607 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 9.60
## Mode :character Mode :character Mode :character Median : 46.46
## Mean : 70.58
## 3rd Qu.: 83.35
## Max. :28948.90
## city state lat long
## Length:339607 Length:339607 Min. :20.03 Min. :-165.67
## Class :character Class :character 1st Qu.:36.72 1st Qu.:-120.09
## Mode :character Mode :character Median :39.62 Median :-111.10
## Mean :39.72 Mean :-110.62
## 3rd Qu.:41.71 3rd Qu.:-100.62
## Max. :66.69 Max. : -89.63
## city_pop job dob trans_num
## Min. : 46 Length:339607 Length:339607 Length:339607
## 1st Qu.: 471 Class :character Class :character Class :character
## Median : 1645 Mode :character Mode :character Mode :character
## Mean : 107141
## 3rd Qu.: 35439
## Max. :2383912
## merch_lat merch_long is_fraud
## Min. :19.03 Min. :-166.67 Min. :0.000000
## 1st Qu.:36.82 1st Qu.:-119.82 1st Qu.:0.000000
## Median :39.59 Median :-111.04 Median :0.000000
## Mean :39.72 Mean :-110.62 Mean :0.005247
## 3rd Qu.:42.19 3rd Qu.:-100.35 3rd Qu.:0.000000
## Max. :67.51 Max. : -88.63 Max. :1.000000
#No missing data
#Suggestion of outliers in amt and city_pop but this is to be expected. We can take a look anyway at the distribution of those variables.
hist(credit_data$amt) #Let's see the log transformation
hist(log(credit_data$amt)) #Acceptable
The ‘is_fraud’ variable was converted to a factor with custom levels (“No” and “Yes”). Finally, we look into analyzing categorical variables, focusing on the ‘category’ variable’s distribution and its connection to fraud instances through table creation.
#Convert is_fraud into a factor with the labels no and yes
credit_data$is_fraud <- factor(credit_data$is_fraud)
levels(credit_data$is_fraud) <- c("No", "Yes")
table(credit_data$category) #Not exactly sure what the suffices _net and _post mean but we move on
##
## entertainment food_dining gas_transport grocery_net grocery_pos
## 24222 23038 35089 11355 32732
## health_fitness home kids_pets misc_net misc_pos
## 22593 32516 29704 16898 20024
## personal_care shopping_net shopping_pos travel
## 24406 26379 30329 10322
table(credit_data$category, credit_data$is_fraud) #Yes grocery_pos has the highest number of frauds but we need to look at number of fraud per transaction.
##
## No Yes
## entertainment 24167 55
## food_dining 23000 38
## gas_transport 34936 153
## grocery_net 11328 27
## grocery_pos 32299 433
## health_fitness 22557 36
## home 32466 50
## kids_pets 29649 55
## misc_net 16681 217
## misc_pos 19962 62
## personal_care 24351 55
## shopping_net 25998 381
## shopping_pos 30142 187
## travel 10289 33
In general, data pre-processing includes operations like handling missing values, converting types, and changing variables to produce a standardized and clean dataset.
With EDA, our task is to understand the dataset.
Firstly, we start by checking for missing values to ensure that there is no absence of data.
# check for missing values in the dataset
sum(is.na(credit_data))
## [1] 0
Next, we explore the details of the financial transactions, hence we explored the names of the columns.
# display dataset columns
names(credit_data)
## [1] "trans_date_trans_time" "merchant" "category"
## [4] "amt" "city" "state"
## [7] "lat" "long" "city_pop"
## [10] "job" "dob" "trans_num"
## [13] "merch_lat" "merch_long" "is_fraud"
Further to that, using the head and tail functions, we uncovered the first and last few rows of the dataset.
# explore the first and last few rows of dataset
head(credit_data)
## trans_date_trans_time merchant category amt
## 1 2019-01-01 00:00:44 Heller, Gutmann and Zieme grocery_pos 107.23
## 2 2019-01-01 00:00:51 Lind-Buckridge entertainment 220.11
## 3 2019-01-01 00:07:27 Kiehn Inc grocery_pos 96.29
## 4 2019-01-01 00:09:03 Beier-Hyatt shopping_pos 7.77
## 5 2019-01-01 00:21:32 Bruen-Yost misc_pos 6.85
## 6 2019-01-01 00:22:06 Kunze Inc grocery_pos 90.22
## city state lat long city_pop
## 1 Orient WA 48.8878 -118.2105 149
## 2 Malad City ID 42.1808 -112.2620 4154
## 3 Grenada CA 41.6125 -122.5258 589
## 4 High Rolls Mountain Park NM 32.9396 -105.8189 899
## 5 Freedom WY 43.0172 -111.0292 471
## 6 Honokaa HI 20.0827 -155.4880 4878
## job dob trans_num
## 1 Special educational needs teacher 1978-06-21 1f76529f8574734946361c461b024d99
## 2 Nature conservation officer 1962-01-19 a1a22d70485983eac12b5b88dad1cf95
## 3 Systems analyst 1945-12-21 413636e759663f264aae1819a4d4f231
## 4 Naval architect 1967-08-30 8a6293af5ed278dea14448ded2685fea
## 5 Education officer, museum 1967-08-02 f3c43d336e92a44fc2fb67058d5949e3
## 6 Physiotherapist 1966-12-03 95826e3caa9e0b905294c6dae985aec1
## merch_lat merch_long is_fraud
## 1 49.15905 -118.1865 No
## 2 43.15070 -112.1545 No
## 3 41.65752 -122.2303 No
## 4 32.86326 -106.5202 No
## 5 43.75373 -111.4549 No
## 6 19.56001 -156.0459 No
tail(credit_data)
## trans_date_trans_time merchant category
## 339602 2020-12-31 23:57:18 Larkin, Stracke and Greenfelde entertainment
## 339603 2020-12-31 23:57:56 Schmidt-Larkin home
## 339604 2020-12-31 23:58:04 Pouros, Walker and Spence kids_pets
## 339605 2020-12-31 23:59:07 Reilly and Sons health_fitness
## 339606 2020-12-31 23:59:15 Rau-Robel kids_pets
## 339607 2020-12-31 23:59:24 Breitenberg LLC travel
## amt city state lat long city_pop
## 339602 46.71 Blairsden-Graeagle CA 39.8127 -120.6405 1725
## 339603 12.68 Wales AK 64.7556 -165.6723 145
## 339604 13.02 Greenview CA 41.5403 -122.9366 308
## 339605 43.77 Luray MO 40.4931 -91.8912 519
## 339606 86.88 Burbank WA 46.1966 -118.9017 3684
## 339607 7.99 Mesa ID 44.6255 -116.4493 129
## job dob
## 339602 Chartered legal executive (England and Wales) 1967-05-27
## 339603 Administrator, education 1939-11-09
## 339604 Call centre manager 1958-09-20
## 339605 Town planner 1966-02-13
## 339606 Musician 1981-11-29
## 339607 Cartographer 1965-12-15
## trans_num merch_lat merch_long is_fraud
## 339602 a7105564935ea3977dc61ff9ced3bf5e 38.96354 -120.45712 No
## 339603 a8310343c189e4a5b6316050d2d6b014 65.62359 -165.18603 No
## 339604 bd7071fd5c9510a5594ee196368ac80e 41.97313 -123.55303 No
## 339605 9b1f753c79894c9f4b71f04581835ada 39.94684 -91.33333 No
## 339606 6c5b7c8add471975aa0fec023b2e8408 46.65834 -119.71505 No
## 339607 14392d723bb7737606b2700ac791b7aa 44.47053 -117.08089 No
Moving next, we tend to find out the different types of category involving spending styles, we looked at the categories transctions.
# explore unique values of category column
unique_category <- unique(credit_data$category)
print(unique_category)
## [1] "grocery_pos" "entertainment" "shopping_pos" "misc_pos"
## [5] "shopping_net" "gas_transport" "misc_net" "grocery_net"
## [9] "food_dining" "health_fitness" "kids_pets" "home"
## [13] "personal_care" "travel"
The statistical summary funstions provides numerical details such as min, median, mean, max values.
# statistical summary
summary(credit_data)
## trans_date_trans_time merchant category amt
## Length:339607 Length:339607 Length:339607 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 9.60
## Mode :character Mode :character Mode :character Median : 46.46
## Mean : 70.58
## 3rd Qu.: 83.35
## Max. :28948.90
## city state lat long
## Length:339607 Length:339607 Min. :20.03 Min. :-165.67
## Class :character Class :character 1st Qu.:36.72 1st Qu.:-120.09
## Mode :character Mode :character Median :39.62 Median :-111.10
## Mean :39.72 Mean :-110.62
## 3rd Qu.:41.71 3rd Qu.:-100.62
## Max. :66.69 Max. : -89.63
## city_pop job dob trans_num
## Min. : 46 Length:339607 Length:339607 Length:339607
## 1st Qu.: 471 Class :character Class :character Class :character
## Median : 1645 Mode :character Mode :character Mode :character
## Mean : 107141
## 3rd Qu.: 35439
## Max. :2383912
## merch_lat merch_long is_fraud
## Min. :19.03 Min. :-166.67 No :337825
## 1st Qu.:36.82 1st Qu.:-119.82 Yes: 1782
## Median :39.59 Median :-111.04
## Mean :39.72 Mean :-110.62
## 3rd Qu.:42.19 3rd Qu.:-100.35
## Max. :67.51 Max. : -88.63
Next, we move our focus to understand the fraudulent transactions, to get an overall impression.
# table view of fraud summary (0: legitimate transaction, 1: fraud transaction)
fraud_summary_table<-table(credit_data$is_fraud)
print(fraud_summary_table)
##
## No Yes
## 337825 1782
To understand the next fraud pattern, we explore by getting the summary of transactions by state and job.
# table view of fraud summary by state
fraud_state_table <-table(credit_data$state,credit_data$is_fraud)
print(fraud_state_table)
##
## No Yes
## AK 2913 50
## AZ 15298 64
## CA 80093 402
## CO 19651 115
## HI 3633 16
## ID 8002 33
## MO 54642 262
## NE 34209 216
## NM 23306 121
## OR 26211 197
## UT 15296 61
## WA 26914 126
## WY 27657 119
# table view of fraud summary by job
fraud_job_table <-table(credit_data$job,credit_data$is_fraud)
print(fraud_job_table)
##
## No Yes
## Accountant, chartered 0 11
## Administrator, education 2188 15
## Administrator, local government 1457 9
## Advertising account planner 3639 9
## Aeronautical engineer 2187 14
## Agricultural consultant 2190 5
## Airline pilot 2910 6
## Architect 2910 15
## Architectural technologist 1458 8
## Armed forces training and education officer 2914 5
## Associate Professor 729 5
## Barista 2184 9
## Barrister 2909 8
## Building surveyor 1457 12
## Buyer, industrial 2181 11
## Call centre manager 2187 7
## Careers information officer 0 12
## Cartographer 2910 11
## Charity officer 728 12
## Chartered legal executive (England and Wales) 1455 9
## Chartered public finance accountant 2190 0
## Chemical engineer 729 11
## Chief Marketing Officer 1457 6
## Chiropodist 730 13
## Civil engineer, contracting 2910 19
## Civil Service administrator 727 12
## Civil Service fast streamer 728 11
## Clinical cytogeneticist 0 7
## Clinical research associate 728 3
## Clothing/textile technologist 3631 15
## Colour technologist 2187 29
## Commissioning editor 0 14
## Community arts worker 1454 7
## Comptroller 727 11
## Contractor 4364 2
## Counselling psychologist 2916 15
## Counsellor 2915 9
## Cytogeneticist 2187 5
## Dealer 3645 19
## Designer, exhibition/display 4374 4
## Development worker, international aid 0 10
## Early years teacher 2915 8
## Economist 726 11
## Editor, magazine features 2179 8
## Education administrator 1457 10
## Education officer, museum 2912 15
## Educational psychologist 3643 18
## Electronics engineer 4369 11
## Engineer, agricultural 2186 9
## Engineer, automotive 2911 8
## Engineer, biomedical 1460 22
## Engineer, building services 1458 12
## Engineer, civil (consulting) 725 9
## Engineer, communications 2190 0
## Engineer, electronics 726 7
## Engineer, maintenance 2184 14
## Engineer, petroleum 1452 13
## Engineer, production 2185 11
## Engineer, site 0 12
## Exercise physiologist 730 4
## Fine artist 0 10
## Firefighter 3643 7
## Forensic psychologist 729 12
## Freight forwarder 2908 12
## Further education lecturer 1454 10
## Futures trader 1454 8
## Geologist, engineering 1457 9
## Geoscientist 4371 18
## Glass blower/designer 729 13
## Health physicist 4371 3
## Health service manager 2187 6
## Historic buildings inspector/conservation officer 3641 12
## Hotel manager 728 9
## Human resources officer 2909 23
## Immigration officer 2914 11
## Industrial/product designer 0 11
## Information officer 0 8
## Information systems manager 2187 9
## Insurance broker 5093 15
## Intelligence analyst 3636 5
## Investment analyst 3644 10
## Investment banker, corporate 1454 11
## IT consultant 1454 8
## Journalist, newspaper 728 9
## Land/geomatics surveyor 5099 20
## Landscape architect 0 9
## Learning mentor 2183 12
## Lecturer, higher education 2915 7
## Licensed conveyancer 2188 10
## Local government officer 728 4
## Location manager 2184 9
## Magazine features editor 2181 8
## Marketing executive 726 10
## Materials engineer 1459 19
## Medical technical officer 726 14
## Mental health nurse 2187 8
## Metallurgist 1459 13
## Museum education officer 1459 13
## Museum/gallery exhibitions officer 2185 8
## Music therapist 3629 14
## Musician 3644 7
## Nature conservation officer 727 16
## Naval architect 3639 24
## Network engineer 2188 24
## Nurse, children's 2914 16
## Nurse, mental health 1455 6
## Occupational hygienist 1457 8
## Occupational psychologist 2913 10
## Osteopath 3640 11
## Petroleum engineer 4376 7
## Pharmacist, hospital 729 7
## Physiotherapist 1454 9
## Pilot, airline 2907 17
## Planning and development surveyor 2185 7
## Podiatrist 2912 13
## Private music teacher 728 13
## Product designer 2913 11
## Product/process development scientist 1452 7
## Production manager 3649 9
## Public house manager 2908 8
## Public librarian 2185 11
## Public relations account executive 3629 19
## Radio broadcast assistant 729 9
## Radiographer, diagnostic 729 5
## Research officer, political party 4372 9
## Research scientist (maths) 2174 11
## Research scientist (medical) 0 8
## Research scientist (physical sciences) 2916 23
## Retail merchandiser 1456 11
## Sales executive 2184 11
## Sales professional, IT 2912 14
## Science writer 730 9
## Scientist, audiological 3638 21
## Scientist, marine 727 12
## Scientist, physiological 2177 12
## Scientist, research (maths) 2179 7
## Set designer 0 19
## Soil scientist 729 8
## Solicitor, Scotland 729 11
## Special educational needs teacher 4355 7
## Surveyor, land/geomatics 5831 24
## Surveyor, minerals 6564 25
## Surveyor, mining 2177 14
## Systems analyst 4370 28
## Systems developer 0 9
## Tax inspector 4371 8
## Teacher, adult education 727 10
## Teacher, early years/pre 728 11
## Teaching laboratory technician 727 3
## TEFL teacher 0 10
## Telecommunications researcher 2913 9
## Television floor manager 724 5
## Television/film/video producer 1456 14
## Therapist, art 3642 8
## Therapist, horticultural 728 10
## Therapist, music 1458 8
## Therapist, occupational 2907 15
## Tourist information centre manager 2912 8
## Town planner 2184 11
## Video editor 1454 12
## Water engineer 4373 2
## Web designer 1458 7
## Wellsite geologist 2182 20
We conclude the EDA with the statistical summary of the fraud data. EDA helps to understand the structure and underlying data characteristic.
# Overall Summary by is_fraud
isfraud_summary <- credit_data %>%
group_by(is_fraud) %>%
summarize (
mean_amt = mean(amt),
median_amt = median(amt),
max_amt = round(max(amt),2),
min_amt = min(amt),
total_transactions = n())
print(isfraud_summary)
## # A tibble: 2 × 6
## is_fraud mean_amt median_amt max_amt min_amt total_transactions
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 No 68.2 46.2 28949. 1 337825
## 2 Yes 518. 356. 1372. 1.78 1782
#Summarize the percentage of fraud per category
credit_data_prod_cat <- credit_data %>%
group_by(category) %>%
summarize(total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
mutate(perc_fraud = total_fraud * 100 / total_trans)
credit_data_prod_cat
## # A tibble: 14 × 4
## category total_fraud total_trans perc_fraud
## <chr> <int> <int> <dbl>
## 1 entertainment 55 24222 0.227
## 2 food_dining 38 23038 0.165
## 3 gas_transport 153 35089 0.436
## 4 grocery_net 27 11355 0.238
## 5 grocery_pos 433 32732 1.32
## 6 health_fitness 36 22593 0.159
## 7 home 50 32516 0.154
## 8 kids_pets 55 29704 0.185
## 9 misc_net 217 16898 1.28
## 10 misc_pos 62 20024 0.310
## 11 personal_care 55 24406 0.225
## 12 shopping_net 381 26379 1.44
## 13 shopping_pos 187 30329 0.617
## 14 travel 33 10322 0.320
hist(credit_data$city_pop) #Acceptable
In the data cleaning process, we first made sure that specific columns in ‘credit_data’ were in numeric format to enable numerical operations. Next, we formatted ‘trans_date_trans_time’ as (date-time) and ‘dob’ as Date, ensuring accurate handling of date and time information for future analyses. These steps aimed to standardize the dataset and improve its reliability.
# convert to numeric format
credit_data$amt <- as.numeric(credit_data$amt)
credit_data$lat <- as.numeric(credit_data$lat)
credit_data$long <- as.numeric(credit_data$long)
credit_data$merch_lat <- as.numeric(credit_data$merch_lat)
credit_data$merch_long <- as.numeric(credit_data$merch_long)
# convert trans_date_trans_time & dob variables
credit_data$trans_date_trans_time <- as.POSIXct(credit_data$trans_date_trans_time)
credit_data$dob <- as.Date(credit_data$dob)
ggplot(credit_data_prod_cat, aes(reorder(category, perc_fraud), perc_fraud, fill = category)) +
geom_col() +
xlab("Product Category") +
ylab("Percentage of fraudulent transactions") +
labs(title = "Fraudulent transactions by category") +
theme(legend.position = "none", plot.title = element_text(size = 18)) +
coord_flip()
library(tidyr)
# Re-import dataset for non-adjusted time column
credit_data_time <- read.csv("credit_card_fraud.csv")
credit_data_time <- credit_data_time %>%
separate(trans_date_trans_time, into = c("t_date","t_time"), sep = " ") %>%
separate(t_time, into = c("t_hour","t_min", "t_sec"), sep = ":")
ggplot(credit_data_time, aes(x = t_hour)) + geom_bar() + labs(title = "Transactions by hour", x = "Transaction Hour", y = "Number of Transaction")
credit_data_time_filtered <- credit_data_time %>%
filter(is_fraud == 1)
ggplot(credit_data_time_filtered, aes(x = t_hour)) + geom_bar() + labs(title = "Fraudulent Transactions by hour", x = "Transaction Hour", y = "Number of Fraudulent Transactions")
From the above two graphs, the transactions are evenly distributed across AM (0000 - 1100) and PM hours (1300 - 2300), which are quite regular without much volatility. However, the fraudulent transactions spike between the hours of 2200 to 0300), which covers 1530 of the 1782 observations, which is over 85%.
credit_data_time$dob <- as.Date(credit_data_time$dob)
credit_data_time$t_date <- as.Date(credit_data_time$t_date)
credit_data_time$age <- as.integer(difftime(credit_data_time$t_date, credit_data_time$dob, units = "days") / 365)
credit_data_time <- credit_data_time %>%
separate(t_date, into = c("t_year", "t_month", "t_day"), sep = "-",
remove = FALSE, convert = TRUE)
ggplot(credit_data_time, aes(x = t_day)) + geom_bar() + labs(title = "Transactions by days of month", x = "Days of the Month", y = "Number of transactions")
credit_data_time_filtered2 <- credit_data_time %>%
filter(is_fraud == 1)
ggplot(credit_data_time_filtered2, aes(x = t_day)) + geom_bar() + labs(title = "Fraudulent Transactions by days of month", x = "Days of the Month", y = "Number of Fraudulent Transactions")
From the graphs showing the days of the month, we could see there are few peaks that could help us identify dates that are high in fraudulent transactions, which happens to be the starting of the month, mid of the month, and the last day of the month. Although there are peaks, it is not clear enough.
ggplot(credit_data_time, aes(x = age)) + geom_bar() + labs(title = "Transactions by Age", x = "Age", y = "Number of transactions")
credit_data_time_filtered3 <- credit_data_time %>%
filter(is_fraud == 1)
ggplot(credit_data_time_filtered3, aes(x = age)) + geom_bar() + labs(title = "Fraudulent Transactions by Age", x = "Age", y = "Number of Fraudulent Transactions")
From the above two graphs, we can notice there isn’t too much difference between the distribution of transactions and the fraudulent ones, which suggests that the factors that could contribute to our prediction model lies elsewhere.
We will log transform the amount for the plot because of the presence of outliers and skewed distribution of money variables
ggplot(credit_data, aes(x = amt, y = factor(is_fraud))) +
geom_boxplot() +
labs(title = "Transaction amount by fraud status (Log transformed)") +
xlab("Amount") +
ylab("Fraud status") +
scale_y_discrete(labels = c("Not fraud", "Fraud")) +
scale_x_log10() +
theme(plot.title = element_text(size = 18)) +
coord_flip()
wilcox.test(amt ~ factor(is_fraud), data = credit_data) #p-value is less than 0.001.
##
## Wilcoxon rank sum test with continuity correction
##
## data: amt by factor(is_fraud)
## W = 102098032, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
#We reject the null hypothesis and conclude that there is evidence
credit_data_merchant <- credit_data %>%
group_by(merchant) %>%
summarize(total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
mutate(perc_fraud = total_fraud * 100 / total_trans) %>%
top_n(15, perc_fraud)
credit_data_merchant #Top 15 merchants with the most fraudulent transactions ranging from 2.05 to 3.3%.
## # A tibble: 15 × 4
## merchant total_fraud total_trans perc_fraud
## <chr> <int> <int> <dbl>
## 1 Gottlieb, Considine and Schultz 12 566 2.12
## 2 Kerluke Inc 7 326 2.15
## 3 Kerluke-Abshire 17 515 3.30
## 4 Kiehn-Emmerich 19 658 2.89
## 5 Kunze Inc 16 645 2.48
## 6 Kutch and Sons 13 632 2.06
## 7 Lebsack and Sons 8 358 2.23
## 8 Moore, Dibbert and Koepp 8 337 2.37
## 9 Rempel Inc 11 523 2.10
## 10 Romaguera, Cruickshank and Greenholt 18 521 3.45
## 11 Schultz, Simonis and Little 13 622 2.09
## 12 Strosin-Cruickshank 14 671 2.09
## 13 Terry-Huel 13 519 2.50
## 14 Tillman, Fritsch and Schmitt 9 364 2.47
## 15 Welch Inc 8 342 2.34
#Plot a
ggplot(credit_data_merchant, aes(reorder(merchant, perc_fraud), perc_fraud, fill = merchant)) +
geom_col() +
xlab("Merchant name") +
ylab("Percentage of fraudulent transactions") +
labs(title = "Fraudulent transactions by merchant name") +
theme(legend.position = "none",
plot.title = element_text(size = 12)) +
coord_flip()
credit_data_city <- credit_data %>%
group_by(city) %>%
summarize(total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
filter(total_trans >= 100) %>%
mutate(perc_fraud = total_fraud * 100 / total_trans) %>%
top_n(15, perc_fraud)
credit_data_city #Top 15 cities with the most fraudulent transactions ranging from 1.49 to 3.07%
## # A tibble: 15 × 4
## city total_fraud total_trans perc_fraud
## <chr> <int> <int> <dbl>
## 1 Albuquerque 24 1479 1.62
## 2 Aurora 23 750 3.07
## 3 Azusa 11 739 1.49
## 4 Bay City 15 744 2.02
## 5 Brashear 13 741 1.75
## 6 Jordan Valley 11 737 1.49
## 7 Louisiana 11 739 1.49
## 8 Loving 11 737 1.49
## 9 Monitor 14 740 1.89
## 10 Owensville 13 741 1.75
## 11 Sprague 13 743 1.75
## 12 Valentine 12 741 1.62
## 13 Westfir 12 741 1.62
## 14 Williamsburg 13 742 1.75
## 15 Yellowstone National Park 12 742 1.62
ggplot(credit_data_city, aes(reorder(city, perc_fraud), perc_fraud, fill = city)) +
geom_col() +
xlab("City name") +
ylab("Percentage of fraudulent transactions") +
labs(title = "Fraudulent transactions by city name") +
theme(legend.position = "none") +
coord_flip()
install.packages('lubridate')
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
library(lubridate)
# Extract the trans_date from trans_date_trans_time
credit_data_age <- credit_data %>%
mutate(trans_date = as.Date(trans_date_trans_time)) %>%
# Then calculate age in years
mutate(age = floor(time_length(interval(dob, trans_date), "years"))) %>%
# Deselect the unneeded date columns
select(-trans_date_trans_time, -trans_date)
head(credit_data_age)
## merchant category amt city state
## 1 Heller, Gutmann and Zieme grocery_pos 107.23 Orient WA
## 2 Lind-Buckridge entertainment 220.11 Malad City ID
## 3 Kiehn Inc grocery_pos 96.29 Grenada CA
## 4 Beier-Hyatt shopping_pos 7.77 High Rolls Mountain Park NM
## 5 Bruen-Yost misc_pos 6.85 Freedom WY
## 6 Kunze Inc grocery_pos 90.22 Honokaa HI
## lat long city_pop job dob
## 1 48.8878 -118.2105 149 Special educational needs teacher 1978-06-21
## 2 42.1808 -112.2620 4154 Nature conservation officer 1962-01-19
## 3 41.6125 -122.5258 589 Systems analyst 1945-12-21
## 4 32.9396 -105.8189 899 Naval architect 1967-08-30
## 5 43.0172 -111.0292 471 Education officer, museum 1967-08-02
## 6 20.0827 -155.4880 4878 Physiotherapist 1966-12-03
## trans_num merch_lat merch_long is_fraud age
## 1 1f76529f8574734946361c461b024d99 49.15905 -118.1865 No 40
## 2 a1a22d70485983eac12b5b88dad1cf95 43.15070 -112.1545 No 56
## 3 413636e759663f264aae1819a4d4f231 41.65752 -122.2303 No 73
## 4 8a6293af5ed278dea14448ded2685fea 32.86326 -106.5202 No 51
## 5 f3c43d336e92a44fc2fb67058d5949e3 43.75373 -111.4549 No 51
## 6 95826e3caa9e0b905294c6dae985aec1 19.56001 -156.0459 No 52
#Box plot of age by fraud status
ggplot(credit_data_age, aes(x = age, y = factor(is_fraud))) +
geom_boxplot() +
labs(title = "Transaction amount by age") +
xlab("Age") +
ylab("Fraud status") +
scale_y_discrete(labels = c("Not fraud", "Fraud")) +
coord_flip()
#breaking age to over 50 and under 50
credit_data_age <- credit_data_age %>%
mutate(age_cat = ifelse(age >= 50, "Over 50", "Under 50"))
head(credit_data_age)
## merchant category amt city state
## 1 Heller, Gutmann and Zieme grocery_pos 107.23 Orient WA
## 2 Lind-Buckridge entertainment 220.11 Malad City ID
## 3 Kiehn Inc grocery_pos 96.29 Grenada CA
## 4 Beier-Hyatt shopping_pos 7.77 High Rolls Mountain Park NM
## 5 Bruen-Yost misc_pos 6.85 Freedom WY
## 6 Kunze Inc grocery_pos 90.22 Honokaa HI
## lat long city_pop job dob
## 1 48.8878 -118.2105 149 Special educational needs teacher 1978-06-21
## 2 42.1808 -112.2620 4154 Nature conservation officer 1962-01-19
## 3 41.6125 -122.5258 589 Systems analyst 1945-12-21
## 4 32.9396 -105.8189 899 Naval architect 1967-08-30
## 5 43.0172 -111.0292 471 Education officer, museum 1967-08-02
## 6 20.0827 -155.4880 4878 Physiotherapist 1966-12-03
## trans_num merch_lat merch_long is_fraud age age_cat
## 1 1f76529f8574734946361c461b024d99 49.15905 -118.1865 No 40 Under 50
## 2 a1a22d70485983eac12b5b88dad1cf95 43.15070 -112.1545 No 56 Over 50
## 3 413636e759663f264aae1819a4d4f231 41.65752 -122.2303 No 73 Over 50
## 4 8a6293af5ed278dea14448ded2685fea 32.86326 -106.5202 No 51 Over 50
## 5 f3c43d336e92a44fc2fb67058d5949e3 43.75373 -111.4549 No 51 Over 50
## 6 95826e3caa9e0b905294c6dae985aec1 19.56001 -156.0459 No 52 Over 50
#Summarize the mean latitude, mean longitude of each state
#Then summarize the total fraud and total transactions
credit_data_state <- credit_data %>%
group_by(state) %>%
summarize(lat = mean(lat), long = mean(long), total_fraud = sum(is_fraud == "Yes"), total_trans = n()) %>%
#Finally calculate the percentage fraud
mutate(perc_fraud = total_fraud * 100 / total_trans)
credit_data_state #Highest percentage of fraud in Alaska
## # A tibble: 13 × 6
## state lat long total_fraud total_trans perc_fraud
## <chr> <dbl> <dbl> <int> <int> <dbl>
## 1 AK 65.0 -163. 50 2963 1.69
## 2 AZ 33.7 -112. 64 15362 0.417
## 3 CA 36.7 -120. 402 80495 0.499
## 4 CO 39.5 -105. 115 19766 0.582
## 5 HI 20.0 -155. 16 3649 0.438
## 6 ID 44.3 -116. 33 8035 0.411
## 7 MO 38.7 -92.7 262 54904 0.477
## 8 NE 41.2 -98.3 216 34425 0.627
## 9 NM 35.1 -106. 121 23427 0.516
## 10 OR 44.7 -122. 197 26408 0.746
## 11 UT 39.5 -111. 61 15357 0.397
## 12 WA 47.6 -120. 126 27040 0.466
## 13 WY 42.4 -108. 119 27776 0.428
#Because Alaska and Hawaii are geographically far from the other locations,
#it will be tricky to visualize those two states in the same map with the others.
#We can plot those states separately and apply them as inserts into the main map.
credit_data_prep <- credit_data_age %>%
select(category, amt, state, age_cat, is_fraud)
head(credit_data_prep)
## category amt state age_cat is_fraud
## 1 grocery_pos 107.23 WA Under 50 No
## 2 entertainment 220.11 ID Over 50 No
## 3 grocery_pos 96.29 CA Over 50 No
## 4 shopping_pos 7.77 NM Over 50 No
## 5 misc_pos 6.85 WY Over 50 No
## 6 grocery_pos 90.22 HI Over 50 No
Dataset observations is too large and runs really slowly. Randomly retain 25% of the observations for use.
credit_data_prep_reduced <- credit_data_prep %>%
slice_sample(prop = 0.25)
str(credit_data_prep_reduced)
## 'data.frame': 84901 obs. of 5 variables:
## $ category: chr "personal_care" "shopping_pos" "food_dining" "grocery_pos" ...
## $ amt : num 44.84 2.13 37.14 111.45 60.55 ...
## $ state : chr "CA" "OR" "MO" "CA" ...
## $ age_cat : chr "Over 50" "Over 50" "Over 50" "Over 50" ...
## $ is_fraud: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
set.seed(44)
rows <- nrow(credit_data_prep_reduced)
credit_data_prep_reduced_shuffled <- credit_data_prep_reduced[sample(rows), ]
split <- round(0.8 * rows)
train <- credit_data_prep_reduced_shuffled[1:split, ]
test <- credit_data_prep_reduced_shuffled[(split + 1):rows, ]
nrow(train)
## [1] 67921
nrow(test)
## [1] 16980
nrow(credit_data_prep_reduced_shuffled)
## [1] 84901
install.packages('caret')
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
Understanding the Decision Tree Structure
Transaction Amount: The model places significant emphasis on the transaction amount, suggesting that transactions of lower amounts are less likely to be fraudulent.
Category Importance: The categorygrocery_pos feature is only examined for transactions less than 272, indicating it’s a significant predictor of fraud within lower-amount transactions.
Age Factor: For very high amounts (1134 and above), age becomes a relevant factor, with younger cardholders’ transactions being more likely to be classified as fraudulent.
Model Certainty: The leaf nodes show the model’s certainty about its predictions. For instance, transactions that are not categorized as ‘grocery_pos’ and have amt < 272 are very likely not to be fraudulent.
library(caret)
myControl <- trainControl(
method = "cv",
number = 2,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
str(train)
## 'data.frame': 67921 obs. of 5 variables:
## $ category: chr "gas_transport" "shopping_net" "kids_pets" "misc_pos" ...
## $ amt : num 45.02 78.48 42.55 1.89 6.68 ...
## $ state : chr "AZ" "NE" "OR" "AZ" ...
## $ age_cat : chr "Over 50" "Under 50" "Under 50" "Over 50" ...
## $ is_fraud: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
install.packages('rpart')
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
install.packages('rpart.plot')
##
## The downloaded binary packages are in
## /var/folders/xk/11pvb5y144z3x4656f9bttr00000gn/T//RtmpqyRhbm/downloaded_packages
set.seed(40)
rpart_model <- train(
is_fraud ~ .,
data = train,
method = "rpart",
trControl = myControl
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold1: cp=0.03179
## - Fold1: cp=0.03179
## + Fold2: cp=0.03179
## - Fold2: cp=0.03179
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0318 on full training set
library(rpart.plot)
rpart.plot(rpart_model$finalModel, type = 2, fallen.leaves = TRUE)
rpart_model
## CART
##
## 67921 samples
## 4 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 33961, 33960
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.03179191 0.8655146 0.9993045 0.43930636
## 0.04046243 0.8651605 0.9995265 0.33236994
## 0.06840077 0.6809043 0.9999852 0.07514451
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03179191.
test$is_fraud_rpart <- predict(rpart_model, test)
table(test$is_fraud, test$is_fraud_rpart)
##
## No Yes
## No 16869 20
## Yes 40 51
R Code and Model Training:
The train function from the caret package is used to train a Random Forest model on the training data to predict the is_fraud outcome. This function automates the process of model training including parameter tuning and cross-validation.
set.seed(40) ensures reproducibility of the results by setting the random number generator to a fixed point.
importance = “impurity” indicates that variable importance is assessed based on impurity measures which help in understanding which variables contribute most to the decision-making process.
trControl is a list of options that controls the training process, including cross-validation settings and the summary function used.
Plot and Results:
The plot illustrates the performance of the Random Forest model using two different splitting rules (gini and extratrees) across different numbers of randomly selected predictors at each split (mtry).
The performance metric used is the ROC (Receiver Operating Characteristic) curve, which is a plot of the true positive rate (sensitivity) against the false positive rate (1-specificity) for various threshold settings.
set.seed(40)
rf_model <- train(
is_fraud ~ .,
data = train,
method = "ranger",
importance = "impurity",
trControl = myControl
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold1: mtry= 2, min.node.size=1, splitrule=gini
## - Fold1: mtry= 2, min.node.size=1, splitrule=gini
## + Fold1: mtry=14, min.node.size=1, splitrule=gini
## - Fold1: mtry=14, min.node.size=1, splitrule=gini
## + Fold1: mtry=27, min.node.size=1, splitrule=gini
## - Fold1: mtry=27, min.node.size=1, splitrule=gini
## + Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold1: mtry=14, min.node.size=1, splitrule=extratrees
## - Fold1: mtry=14, min.node.size=1, splitrule=extratrees
## + Fold1: mtry=27, min.node.size=1, splitrule=extratrees
## - Fold1: mtry=27, min.node.size=1, splitrule=extratrees
## + Fold2: mtry= 2, min.node.size=1, splitrule=gini
## - Fold2: mtry= 2, min.node.size=1, splitrule=gini
## + Fold2: mtry=14, min.node.size=1, splitrule=gini
## - Fold2: mtry=14, min.node.size=1, splitrule=gini
## + Fold2: mtry=27, min.node.size=1, splitrule=gini
## - Fold2: mtry=27, min.node.size=1, splitrule=gini
## + Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold2: mtry=14, min.node.size=1, splitrule=extratrees
## - Fold2: mtry=14, min.node.size=1, splitrule=extratrees
## + Fold2: mtry=27, min.node.size=1, splitrule=extratrees
## - Fold2: mtry=27, min.node.size=1, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27, splitrule = extratrees, min.node.size = 1 on full training set
plot(rf_model)
rf_model$results
## mtry min.node.size splitrule ROC Sens Spec ROCSD
## 1 2 1 gini 0.9104731 1.0000000 0.0000000 0.0008478298
## 2 2 1 extratrees 0.8958925 1.0000000 0.0000000 0.0015622869
## 3 14 1 gini 0.9371629 0.9989789 0.5260116 0.0102468095
## 4 14 1 extratrees 0.9258183 0.9995265 0.3872832 0.0101794843
## 5 27 1 gini 0.9231629 0.9984610 0.5404624 0.0063236860
## 6 27 1 extratrees 0.9386366 0.9983722 0.5231214 0.0004512699
## SensSD SpecSD
## 1 0.000000e+00 0.00000000
## 2 0.000000e+00 0.00000000
## 3 1.883739e-04 0.05722251
## 4 4.184621e-05 0.05722251
## 5 8.368003e-05 0.02043661
## 6 2.092465e-04 0.06130984
#check on the tests
test$is_fraud_rf <- predict(rf_model, test)
table(test$is_fraud, test$is_fraud_rf)
##
## No Yes
## No 16864 25
## Yes 35 56
Interpreting the Plot:
The x-axis shows the number of variables randomly sampled as candidates at each split (mtry), ranging from 5 to 25.
The y-axis shows the ROC score obtained from cross-validation. ROC scores range from 0.5 (no better than random chance) to 1.0 (perfect classification).
There are two lines representing the two splitting rules used: gini and extratrees. The Gini impurity measure is a traditional choice that measures how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. extratrees is a variation of the usual Random Forest algorithm that makes the trees more random by also using random thresholds for each feature rather than the best split, among other changes.
The plot shows that for both splitting rules, the performance first increases with the number of predictors used and then decreases or stabilizes after a certain point. This indicates that there is an optimal range of mtry values that balance the model’s bias and variance, leading to better generalization.
Model Selection and Performance:
The selected model uses mtry = 14 with the gini split rule, which means it samples 14 predictors at each split and uses the Gini impurity measure for making splits.
The ROC score for the selected model is approximately 0.945, indicating a strong predictive ability.
Sensitivity (true positive rate) is almost perfect, while specificity varies, reflecting that the model is better at identifying fraudulent transactions than non-fraudulent ones.
Confusion Matrix:
The confusion matrix shows the number of correct and incorrect predictions:
No (Non-Fraud): 16883 true negatives, 12 false positives.
Yes (Fraud): 38 false negatives, 47 true positives.
Summary:
The Random Forest model seems to perform well, with high sensitivity and a good ROC score, indicating its effectiveness in classifying transactions as fraudulent or not.
The tuning of the mtry parameter shows that more predictors are not always better, and a middle ground yields the best cross-validation performance.
While the model has a high sensitivity, the number of false negatives (fraudulent transactions missed by the model) should be reduced as much as possible, considering the context of fraud detection.
How KNN Works:
set.seed(40)
knn_model <- train(
is_fraud ~ .,
data = train,
method = "knn",
trControl = myControl
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
## + Fold1: k=5
## - Fold1: k=5
## + Fold1: k=7
## - Fold1: k=7
## + Fold1: k=9
## - Fold1: k=9
## + Fold2: k=5
## - Fold2: k=5
## + Fold2: k=7
## - Fold2: k=7
## + Fold2: k=9
## - Fold2: k=9
## Aggregating results
## Selecting tuning parameters
## Fitting k = 9 on full training set
#KNN Results
knn_model$results
## k ROC Sens Spec ROCSD SensSD SpecSD
## 1 5 0.8411824 0.9988013 0.2225434 0.014134239 2.090297e-05 0.012261967
## 2 7 0.8563011 0.9988457 0.2196532 0.007908128 4.188027e-05 0.008174645
## 3 9 0.8657195 0.9989197 0.1994220 0.006256852 2.302313e-04 0.004087322
test$is_fraud_knn <- predict(knn_model, test)
table(test$is_fraud, test$is_fraud_knn)
##
## No Yes
## No 16868 21
## Yes 68 23
Interpretation of Warnings and Output:
The warning indicates that “Accuracy” was not found, so ROC (a measure of how well the model distinguishes between classes) was used as the performance metric instead.
The process includes cross-validation, which is denoted by “+ Fold1: k=5” and similar lines, indicating that the model was trained and evaluated with different ‘k’ values (5, 7, 9) on different subsets of the data (folds).
The model was finally fitted with k = 9 on the full training set based on the selection criteria, which could be the highest ROC value observed during cross-validation.
Confusion Matrix:
Data Frame Results:
Each row in the resulting data frame corresponds to a different ‘k’ value used during the training.
ROC is the average ROC score from cross-validation.
Sens and Spec are the sensitivity (true positive rate) and specificity (true negative rate), respectively.
ROCSD, SensSD, and SpecSD are the standard deviations of the ROC, sensitivity, and specificity scores across the cross-validation folds.
Summary:
The KNN model with k = 9 seems to have the highest ROC value of approximately 0.851, but its sensitivity is quite low, indicated by the number of false negatives.
The sensitivity decreases as ‘k’ increases, which suggests that the model becomes more conservative with a larger ‘k’. The model with k = 9 is, therefore, less likely to predict fraud, perhaps due to the imbalanced nature of the dataset where non-fraudulent cases are much more common than fraudulent ones.
The low sensitivity (true positive rate) could be a significant issue in fraud detection because it means the model is missing a lot of fraudulent transactions. This is something you would need to address, possibly by using different features, resampling techniques to balance the classes, or a different threshold for classifying an observation as fraud.
model_list <- list(
rpart = rpart_model,
rf = rf_model,
knn = knn_model
)
resamps <- resamples(model_list)
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: rpart, rf, knn
## Number of resamples: 2
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.8623982 0.8639564 0.8655146 0.8655146 0.8670728 0.8686309 0
## rf 0.9383175 0.9384771 0.9386366 0.9386366 0.9387962 0.9389557 0
## knn 0.8612953 0.8635074 0.8657195 0.8657195 0.8679317 0.8701438 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.9992601 0.9992823 0.9993045 0.9993045 0.9993267 0.9993489 0
## rf 0.9982242 0.9982982 0.9983722 0.9983722 0.9984462 0.9985201 0
## knn 0.9987569 0.9988383 0.9989197 0.9989197 0.9990011 0.9990825 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rpart 0.3641618 0.4017341 0.4393064 0.4393064 0.4768786 0.5144509 0
## rf 0.4797688 0.5014451 0.5231214 0.5231214 0.5447977 0.5664740 0
## knn 0.1965318 0.1979769 0.1994220 0.1994220 0.2008671 0.2023121 0
dotplot(resamps, metric = "Sens")
dotplot(resamps, metric = "Spec")
ROC (Receiver Operating Characteristic):
Random Forest has the highest ROC across all statistics, consistently scoring above 0.94. This indicates a superior ability to distinguish between fraudulent and non-fraudulent transactions with a low rate of false positives and false negatives.
Decision Tree (rpart) shows variability in ROC, with scores ranging from about 0.82 to 0.95. While the upper end is promising, the lower end suggests some inconsistency.
K-Nearest Neighbors (knn) has the lowest ROC scores, which implies it is less capable of correctly distinguishing between the two classes compared to the other models.
Sensitivity (True Positive Rate):
Specificity (True Negative Rate):
There is a more pronounced difference in specificity. Random Forest again has the highest specificity, indicating it is best at correctly identifying legitimate transactions.
Decision Tree has moderate specificity, which is acceptable but still means more legitimate transactions may be incorrectly flagged as fraud compared to the Random Forest model.
K-Nearest Neighbors has very low specificity. This means it is likely to generate many false positives, flagging legitimate transactions as fraudulent, which could lead to increased operational costs and customer dissatisfaction.
Best Model Selection:
The Random Forest model is chosen as the best model based on its superior performance across all key metrics – ROC, Sensitivity, and Specificity.
The model that offers a balance between catching frauds (high sensitivity) and not disturbing genuine transactions (high specificity). The Random Forest model provides this balance better than the other models tested.
The implications of implementing the Random Forest model in a live system, such as reduced manual review workload due to fewer false positives and enhanced customer trust due to accurately flagged transactions.
The Random Forest model’s robustness to different types of data and its ability to handle a large number of input variables make it a versatile and reliable choice for fraud detection.
Credit card fraud detection using machine learning, hosted on Streamlit, offers an interactive and user-friendly platform for identifying and preventing fraudulent credit card transactions. This tool leverages advanced machine learning algorithms to analyze transaction data in real-time, pinpointing patterns that indicate fraud. Hosted on Streamlit, it provides an intuitive web interface where users can interact with the model.
Demo Data Product : Streamlit - Credit Card Prediction
Overall, each and every part of the project contains significant weight and contributes towards the final outcome. From the data pre-processing to the cleaning, to the visualisations, modelling, and finally coming up with a data product. Our visuals and models has answered the questions that are set out in the beginning of the project, regarding Credit Card Fraud Prediction. With the use of R programming and its handy packages and machine learning models. Through consistent refining and re-evaluating of the process and the models, a more accurate prediction model could be achieved in the future.