Credit card fraud detection has emerged as a critical challenge in modern financial transactions due to the increasing prevalence of online transactions and digital payment methods. The unauthorized use of credit cards for fraudulent activities poses substantial financial risks to individuals, businesses, and financial institutions. To combat this issue, sophisticated fraud detection systems are required to swiftly identify and prevent fraudulent transactions, ensuring the security and trustworthiness of financial operations.
This project aims to develop a credit card fraud detection system using machine learning algorithms and advanced analytics techniques. The system will analyze transaction data in real-time to identify potentially fraudulent activities and alert users or financial institutions to take appropriate actions.
This project is suitable for financial institutions such as banks, credit card companies, and payment processors that want to enhance their fraud detection capabilities and protect customers from fraudulent transactions.
The primary users of the credit card fraud detection system include fraud analysts, risk managers, and security teams within financial organizations. Additionally, customers who use credit cards for online or in-person transactions will indirectly benefit from the increased security provided by the system.
By implementing an effective fraud detection system, organizations can reduce financial losses due to fraudulent transactions, minimize reputational damage, and enhance customer trust and loyalty. Additionally, the system can help improve regulatory compliance and mitigate legal risks associated with fraud-related incidents.
Which machine learning models are better to classify the credit card fraud cases?
Are older customers more likely to participate in credit card fraud compared to younger customers?
To identify patterns and anomalies in credit card transaction data that may indicate fraudulent activity.
To compare which machine learning algorithms capable of accurately predicting and detecting fraudulent transactions.
To evaluate the performance of the fraud detection system, including metrics such as accuracy, precision, recall, F1 score, and receiver operating characteristic (ROC) curve analysis. Additionally, the system should be assessed for scalability, computational efficiency, and real-world usability.
| Variable | Descrtiption | Data type |
|---|---|---|
| X | Unique identifier | integer |
| trans_date_trans_time | Transaction date and time | datetime |
| cc_num | Credit card number of customers | integer |
| merchant | Name of merchant | categorical |
| category | categorical | |
| amt | Amount | float |
| first | first name | categorical |
| last | last name | categorical |
| gender | categorical | |
| street | street name | categorical |
| city | city name | categorical |
| state | state name | categorical |
| zip | Zip code | categorical |
| lat | Latitude | float |
| long | Longitude | float |
| city_pop | City population | integer |
| job | Job title | categorical |
| trans_num | categorical | |
| unix_time | datetime | |
| merch_lat | Merchant latitude | float |
| merch_long | Merchant longitude | float |
| is_fraud | Target variable: fraud status (0=Not fraud, 1=Fraud) | categorical |
- New variable(s) created
| Variable | Descrtiption | Data type |
|---|---|---|
| age | Age | integer |
- Variables to be removed
| Variable | Descrtiption | Data type | Reason |
|---|---|---|---|
| dob | Date of birth | datetime | New ‘age’ column created from ‘dob’ column |
# import necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(caret)
## Loading required package: lattice
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# quite slow to load because the files are too big # take around 5 minutes, once loaded, you don't need to re-run this code block
fraudTrain_df_raw <- read.csv("./fraudTrain.csv")
fraudTest_df_raw <- read.csv("./fraudTest.csv")
fraudTrain_df <- data.frame(fraudTrain_df_raw)
fraudTest_df <- data.frame(fraudTest_df_raw)
# combine the train and test data
combined_fraud_df_raw <- bind_rows(fraudTrain_df, fraudTest_df)
head(combined_fraud_df_raw, 10)
## X trans_date_trans_time cc_num merchant
## 1 0 2019-01-01 00:00:18 2.703186e+15 fraud_Rippin, Kub and Mann
## 2 1 2019-01-01 00:00:44 6.304233e+11 fraud_Heller, Gutmann and Zieme
## 3 2 2019-01-01 00:00:51 3.885949e+13 fraud_Lind-Buckridge
## 4 3 2019-01-01 00:01:16 3.534094e+15 fraud_Kutch, Hermiston and Farrell
## 5 4 2019-01-01 00:03:06 3.755342e+14 fraud_Keeling-Crist
## 6 5 2019-01-01 00:04:08 4.767265e+15 fraud_Stroman, Hudson and Erdman
## 7 6 2019-01-01 00:04:42 3.007469e+13 fraud_Rowe-Vandervort
## 8 7 2019-01-01 00:05:08 6.011361e+15 fraud_Corwin-Collins
## 9 8 2019-01-01 00:05:18 4.922711e+15 fraud_Herzog Ltd
## 10 9 2019-01-01 00:06:01 2.720830e+15 fraud_Schoen, Kuphal and Nitzsche
## category amt first last gender
## 1 misc_net 4.97 Jennifer Banks F
## 2 grocery_pos 107.23 Stephanie Gill F
## 3 entertainment 220.11 Edward Sanchez M
## 4 gas_transport 45.00 Jeremy White M
## 5 misc_pos 41.96 Tyler Garcia M
## 6 gas_transport 94.63 Jennifer Conner F
## 7 grocery_net 44.54 Kelsey Richards F
## 8 gas_transport 71.65 Steven Williams M
## 9 misc_pos 4.27 Heather Chase F
## 10 grocery_pos 198.39 Melissa Aguilar F
## street city state zip lat long
## 1 561 Perry Cove Moravian Falls NC 28654 36.0788 -81.1781
## 2 43039 Riley Greens Suite 393 Orient WA 99160 48.8878 -118.2105
## 3 594 White Dale Suite 530 Malad City ID 83252 42.1808 -112.2620
## 4 9443 Cynthia Court Apt. 038 Boulder MT 59632 46.2306 -112.1138
## 5 408 Bradley Rest Doe Hill VA 24433 38.4207 -79.4629
## 6 4655 David Island Dublin PA 18917 40.3750 -75.2045
## 7 889 Sarah Station Suite 624 Holcomb KS 67851 37.9931 -100.9893
## 8 231 Flores Pass Suite 720 Edinburg VA 22824 38.8432 -78.6003
## 9 6888 Hicks Stream Suite 954 Manor PA 15665 40.3359 -79.6607
## 10 21326 Taylor Squares Suite 708 Clarksville TN 37040 36.5220 -87.3490
## city_pop job dob
## 1 3495 Psychologist, counselling 1988-03-09
## 2 149 Special educational needs teacher 1978-06-21
## 3 4154 Nature conservation officer 1962-01-19
## 4 1939 Patent attorney 1967-01-12
## 5 99 Dance movement psychotherapist 1986-03-28
## 6 2158 Transport planner 1961-06-19
## 7 2691 Arboriculturist 1993-08-16
## 8 6018 Designer, multimedia 1947-08-21
## 9 1472 Public affairs consultant 1941-03-07
## 10 151785 Pathologist 1974-03-28
## trans_num unix_time merch_lat merch_long is_fraud
## 1 0b242abb623afc578575680df30655b9 1325376018 36.01129 -82.04832 0
## 2 1f76529f8574734946361c461b024d99 1325376044 49.15905 -118.18646 0
## 3 a1a22d70485983eac12b5b88dad1cf95 1325376051 43.15070 -112.15448 0
## 4 6b849c168bdad6f867558c3793159a81 1325376076 47.03433 -112.56107 0
## 5 a41d7549acf90789359a9aa5346dcb46 1325376186 38.67500 -78.63246 0
## 6 189a841a0a8ba03058526bcfe566aab5 1325376248 40.65338 -76.15267 0
## 7 83ec1cc84142af6e2acf10c44949e720 1325376282 37.16270 -100.15337 0
## 8 6d294ed2cc447d2c71c7171a3d54967c 1325376308 38.94809 -78.54030 0
## 9 fc28024ce480f8ef21a32d64c93a29f5 1325376318 40.35181 -79.95815 0
## 10 3b9014ea8fb80bd65de0b1463b00b00e 1325376361 37.17920 -87.48538 0
# save another dataframe copy so that we need not to re-download the file with the script above when we messed up with the dataset
# remove the first column: 'X', which is just the numbering
fraud_df <- combined_fraud_df_raw[-1,]
glimpse(fraud_df)
## Rows: 593,664
## Columns: 23
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ trans_date_trans_time <chr> "2019-01-01 00:00:44", "2019-01-01 00:00:51", "2…
## $ cc_num <dbl> 6.304233e+11, 3.885949e+13, 3.534094e+15, 3.7553…
## $ merchant <chr> "fraud_Heller, Gutmann and Zieme", "fraud_Lind-B…
## $ category <chr> "grocery_pos", "entertainment", "gas_transport",…
## $ amt <dbl> 107.23, 220.11, 45.00, 41.96, 94.63, 44.54, 71.6…
## $ first <chr> "Stephanie", "Edward", "Jeremy", "Tyler", "Jenni…
## $ last <chr> "Gill", "Sanchez", "White", "Garcia", "Conner", …
## $ gender <chr> "F", "M", "M", "M", "F", "F", "M", "F", "F", "M"…
## $ street <chr> "43039 Riley Greens Suite 393", "594 White Dale …
## $ city <chr> "Orient", "Malad City", "Boulder", "Doe Hill", "…
## $ state <chr> "WA", "ID", "MT", "VA", "PA", "KS", "VA", "PA", …
## $ zip <int> 99160, 83252, 59632, 24433, 18917, 67851, 22824,…
## $ lat <dbl> 48.8878, 42.1808, 46.2306, 38.4207, 40.3750, 37.…
## $ long <dbl> -118.2105, -112.2620, -112.1138, -79.4629, -75.2…
## $ city_pop <int> 149, 4154, 1939, 99, 2158, 2691, 6018, 1472, 151…
## $ job <chr> "Special educational needs teacher", "Nature con…
## $ dob <chr> "1978-06-21", "1962-01-19", "1967-01-12", "1986-…
## $ trans_num <chr> "1f76529f8574734946361c461b024d99", "a1a22d70485…
## $ unix_time <int> 1325376044, 1325376051, 1325376076, 1325376186, …
## $ merch_lat <dbl> 49.15905, 43.15070, 47.03433, 38.67500, 40.65338…
## $ merch_long <dbl> -118.18646, -112.15448, -112.56107, -78.63246, -…
## $ is_fraud <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# some simple summary statistics of all of the variables in our fraud_df data frame
summary(fraud_df)
## X trans_date_trans_time cc_num merchant
## Min. : 0 Length:593664 Min. :6.042e+10 Length:593664
## 1st Qu.:110471 Class :character 1st Qu.:1.800e+14 Class :character
## Median :258887 Mode :character Median :3.521e+15 Mode :character
## Mean :261312 Mean :4.176e+17
## 3rd Qu.:407302 3rd Qu.:4.635e+15
## Max. :555718 Max. :4.992e+18
##
## category amt first last
## Length:593664 Min. : 1.00 Length:593664 Length:593664
## Class :character 1st Qu.: 9.63 Class :character Class :character
## Mode :character Median : 47.34 Mode :character Mode :character
## Mean : 69.56
## 3rd Qu.: 83.06
## Max. :22768.11
##
## gender street city state
## Length:593664 Length:593664 Length:593664 Length:593664
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## zip lat long city_pop
## Min. : 1257 Min. :20.03 Min. :-165.67 Min. : 23
## 1st Qu.:26237 1st Qu.:34.67 1st Qu.: -96.80 1st Qu.: 741
## Median :48174 Median :39.37 Median : -87.48 Median : 2408
## Mean :48835 Mean :38.54 Mean : -90.23 Mean : 88300
## 3rd Qu.:72011 3rd Qu.:41.89 3rd Qu.: -80.18 3rd Qu.: 19685
## Max. :99921 Max. :65.69 Max. : -67.95 Max. :2906700
##
## job dob trans_num unix_time
## Length:593664 Length:593664 Length:593664 Min. :1.325e+09
## Class :character Class :character Class :character 1st Qu.:1.375e+09
## Mode :character Mode :character Mode :character Median :1.380e+09
## Mean :1.377e+09
## 3rd Qu.:1.386e+09
## Max. :1.389e+09
## NA's :1
## merch_lat merch_long is_fraud
## Min. :19.03 Min. :-166.67 Min. :0.000000
## 1st Qu.:34.76 1st Qu.: -96.91 1st Qu.:0.000000
## Median :39.37 Median : -87.44 Median :0.000000
## Mean :38.54 Mean : -90.23 Mean :0.004245
## 3rd Qu.:41.95 3rd Qu.: -80.26 3rd Qu.:0.000000
## Max. :66.68 Max. : -66.95 Max. :1.000000
## NA's :1 NA's :1 NA's :1
# Basic statistics for all the variables
# get total count, unique count and missing value count of each columns
summary_table <- fraud_df %>% reframe(
name=as.vector(names(fraud_df)),
count=n(), # total count
unique=sapply(fraud_df, function (x){length(unique(x))}), # unique count
missing_value_count=sapply(fraud_df, function(x){sum(is.na(x))}) # check missing values
)
summary_table
## name count unique missing_value_count
## 1 X 593664 555719 0
## 2 trans_date_trans_time 593664 582252 0
## 3 cc_num 593664 925 0
## 4 merchant 593664 693 0
## 5 category 593664 14 0
## 6 amt 593664 38297 0
## 7 first 593664 341 0
## 8 last 593664 471 0
## 9 gender 593664 2 0
## 10 street 593664 925 0
## 11 city 593664 849 0
## 12 state 593664 50 0
## 13 zip 593664 913 0
## 14 lat 593664 911 0
## 15 long 593664 911 0
## 16 city_pop 593664 835 0
## 17 job 593664 478 0
## 18 dob 593664 911 0
## 19 trans_num 593664 593664 0
## 20 unix_time 593664 582252 1
## 21 merch_lat 593664 583142 1
## 22 merch_long 593664 589156 1
## 23 is_fraud 593664 3 1
# removing null values
if (any(is.na(fraud_df))) {
no_of_rows_with_missing_values <- sum(is.na(fraud_df))
fraud_df <- na.omit(fraud_df)
print(paste0(no_of_rows_with_missing_values,
"Rows with missing values are removed"))
} else {
print("There is no missing values")
}
## [1] "4Rows with missing values are removed"
# check if there is any duplicated rows in the fraud_df dataset
if (sum(duplicated(fraud_df)) > 0) {
fraud_df <- fraud_df[!duplicated(fraud_df), ]
} else {
print("No duplicated rows found!")
}
## [1] "No duplicated rows found!"
# assign correct data type to each variable in the data frame (e.g., factor, vector, list, etc)
fraud_df$cc_num <- as.factor(fraud_df$cc_num)
fraud_df$merchant <- as.factor(fraud_df$merchant)
fraud_df$category<- as.factor(fraud_df$category)
fraud_df$amt <- as.numeric(fraud_df$amt)
fraud_df$first <- as.factor(fraud_df$first)
fraud_df$last <- as.factor(fraud_df$last)
fraud_df$gender <- as.factor(fraud_df$gender)
fraud_df$street <- as.factor(fraud_df$street)
fraud_df$city <- as.factor(fraud_df$city)
fraud_df$state <- as.factor(fraud_df$state)
fraud_df$zip <- as.factor(fraud_df$zip)
fraud_df$lat <- as.numeric(fraud_df$lat)
fraud_df$long <- as.numeric(fraud_df$long)
fraud_df$city_pop <- as.numeric(fraud_df$city_pop)
fraud_df$job <- as.factor(fraud_df$job)
fraud_df$dob <- as.Date(fraud_df$dob)
fraud_df$trans_num <- as.factor(fraud_df$trans_num)
fraud_df$unix_time <- as.Date(as.POSIXct(fraud_df$unix_time, origin="1970-01-01")) # convert unix time to Date object
fraud_df$merch_lat <- as.numeric(fraud_df$merch_lat)
fraud_df$merch_long <- as.numeric(fraud_df$merch_long)
fraud_df$is_fraud <- as.factor(fraud_df$is_fraud)
# check back again the data type of each variables after data type conversion in the previous step
glimpse(fraud_df)
## Rows: 593,663
## Columns: 23
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ trans_date_trans_time <chr> "2019-01-01 00:00:44", "2019-01-01 00:00:51", "2…
## $ cc_num <fct> 630423337322, 38859492057661, 3534093764340240, …
## $ merchant <fct> "fraud_Heller, Gutmann and Zieme", "fraud_Lind-B…
## $ category <fct> grocery_pos, entertainment, gas_transport, misc_…
## $ amt <dbl> 107.23, 220.11, 45.00, 41.96, 94.63, 44.54, 71.6…
## $ first <fct> Stephanie, Edward, Jeremy, Tyler, Jennifer, Kels…
## $ last <fct> Gill, Sanchez, White, Garcia, Conner, Richards, …
## $ gender <fct> F, M, M, M, F, F, M, F, F, M, F, M, M, M, M, F, …
## $ street <fct> 43039 Riley Greens Suite 393, 594 White Dale Sui…
## $ city <fct> Orient, Malad City, Boulder, Doe Hill, Dublin, H…
## $ state <fct> WA, ID, MT, VA, PA, KS, VA, PA, TN, IA, WV, FL, …
## $ zip <fct> 99160, 83252, 59632, 24433, 18917, 67851, 22824,…
## $ lat <dbl> 48.8878, 42.1808, 46.2306, 38.4207, 40.3750, 37.…
## $ long <dbl> -118.2105, -112.2620, -112.1138, -79.4629, -75.2…
## $ city_pop <dbl> 149, 4154, 1939, 99, 2158, 2691, 6018, 1472, 151…
## $ job <fct> "Special educational needs teacher", "Nature con…
## $ dob <date> 1978-06-21, 1962-01-19, 1967-01-12, 1986-03-28,…
## $ trans_num <fct> 1f76529f8574734946361c461b024d99, a1a22d70485983…
## $ unix_time <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01,…
## $ merch_lat <dbl> 49.15905, 43.15070, 47.03433, 38.67500, 40.65338…
## $ merch_long <dbl> -118.18646, -112.15448, -112.56107, -78.63246, -…
## $ is_fraud <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# create a new column called "age"
reference_date <- as.Date("2019-01-01")
fraud_df$age <- as.numeric(lubridate::interval(fraud_df$dob, reference_date) / years(1))
fraud_df <- subset(fraud_df, select=c(-dob))
# convert the data type of 'age' column to numeric
fraud_df$age <- as.numeric(fraud_df$age)
head(fraud_df,10)
## X trans_date_trans_time cc_num merchant
## 2 1 2019-01-01 00:00:44 630423337322 fraud_Heller, Gutmann and Zieme
## 3 2 2019-01-01 00:00:51 38859492057661 fraud_Lind-Buckridge
## 4 3 2019-01-01 00:01:16 3534093764340240 fraud_Kutch, Hermiston and Farrell
## 5 4 2019-01-01 00:03:06 375534208663984 fraud_Keeling-Crist
## 6 5 2019-01-01 00:04:08 4767265376804500 fraud_Stroman, Hudson and Erdman
## 7 6 2019-01-01 00:04:42 30074693890476 fraud_Rowe-Vandervort
## 8 7 2019-01-01 00:05:08 6011360759745864 fraud_Corwin-Collins
## 9 8 2019-01-01 00:05:18 4922710831011201 fraud_Herzog Ltd
## 10 9 2019-01-01 00:06:01 2720830304681674 fraud_Schoen, Kuphal and Nitzsche
## 11 10 2019-01-01 00:06:23 4642894980163 fraud_Rutherford-Mertz
## category amt first last gender
## 2 grocery_pos 107.23 Stephanie Gill F
## 3 entertainment 220.11 Edward Sanchez M
## 4 gas_transport 45.00 Jeremy White M
## 5 misc_pos 41.96 Tyler Garcia M
## 6 gas_transport 94.63 Jennifer Conner F
## 7 grocery_net 44.54 Kelsey Richards F
## 8 gas_transport 71.65 Steven Williams M
## 9 misc_pos 4.27 Heather Chase F
## 10 grocery_pos 198.39 Melissa Aguilar F
## 11 grocery_pos 24.74 Eddie Mendez M
## street city state zip lat long
## 2 43039 Riley Greens Suite 393 Orient WA 99160 48.8878 -118.2105
## 3 594 White Dale Suite 530 Malad City ID 83252 42.1808 -112.2620
## 4 9443 Cynthia Court Apt. 038 Boulder MT 59632 46.2306 -112.1138
## 5 408 Bradley Rest Doe Hill VA 24433 38.4207 -79.4629
## 6 4655 David Island Dublin PA 18917 40.3750 -75.2045
## 7 889 Sarah Station Suite 624 Holcomb KS 67851 37.9931 -100.9893
## 8 231 Flores Pass Suite 720 Edinburg VA 22824 38.8432 -78.6003
## 9 6888 Hicks Stream Suite 954 Manor PA 15665 40.3359 -79.6607
## 10 21326 Taylor Squares Suite 708 Clarksville TN 37040 36.5220 -87.3490
## 11 1831 Faith View Suite 653 Clarinda IA 51632 40.7491 -95.0380
## city_pop job trans_num
## 2 149 Special educational needs teacher 1f76529f8574734946361c461b024d99
## 3 4154 Nature conservation officer a1a22d70485983eac12b5b88dad1cf95
## 4 1939 Patent attorney 6b849c168bdad6f867558c3793159a81
## 5 99 Dance movement psychotherapist a41d7549acf90789359a9aa5346dcb46
## 6 2158 Transport planner 189a841a0a8ba03058526bcfe566aab5
## 7 2691 Arboriculturist 83ec1cc84142af6e2acf10c44949e720
## 8 6018 Designer, multimedia 6d294ed2cc447d2c71c7171a3d54967c
## 9 1472 Public affairs consultant fc28024ce480f8ef21a32d64c93a29f5
## 10 151785 Pathologist 3b9014ea8fb80bd65de0b1463b00b00e
## 11 7297 IT trainer d71c95ab6b7356dd74389d41df429c87
## unix_time merch_lat merch_long is_fraud age
## 2 2012-01-01 49.15905 -118.18646 0 40.53151
## 3 2012-01-01 43.15070 -112.15448 0 56.95068
## 4 2012-01-01 47.03433 -112.56107 0 51.96986
## 5 2012-01-01 38.67500 -78.63246 0 32.76438
## 6 2012-01-01 40.65338 -76.15267 0 57.53699
## 7 2012-01-01 37.16270 -100.15337 0 25.37808
## 8 2012-01-01 38.94809 -78.54030 0 71.36438
## 9 2012-01-01 40.35181 -79.95815 0 77.82192
## 10 2012-01-01 37.17920 -87.48538 0 44.76438
## 11 2012-01-01 40.27589 -96.01155 0 28.47123
genuine_vs_fraud_amt <- fraud_df %>%
mutate(is_fraud_label = ifelse(is_fraud == 0, "Genuine", "Fraud"))
# Plotting boxplot
boxplot(amt ~ is_fraud_label, data = genuine_vs_fraud_amt,
main = "Comparing Genuine vs Fraud average amounts")
fraud_category_count <- fraud_df %>% group_by(is_fraud, category
) %>% reframe(count=n()) %>% spread(is_fraud, count)
# rename the column variables
colnames(fraud_category_count) <- c("category", "category_0", "category_1")
fraud_category_count <- fraud_category_count %>% mutate(
total_count=category_0+category_1) %>% mutate(
fraud_percent=category_1/total_count*100) %>% arrange(desc(fraud_percent))
ggplot(fraud_category_count, aes(
x=reorder(category,-fraud_percent), y=fraud_percent)
) + geom_bar(stat='identity', width=0.5) + labs(
title="Percentage of fraud across various categories (%)",
x="Category",
y="Percentage of fraud (%)") + geom_text(
aes(label=round(fraud_percent, 2)), vjust=-0.5
) + theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
plot.margin = unit(c(1, 1, 1, 1), "cm"), # Adjusts the margins of the plot
legend.position = "none"
)
# Who are the top 10 cities that are most often affected by fraud cases
city_fraud_case_count <- fraud_df[which(fraud_df$is_fraud == "1"),
] %>% group_by(city) %>% summarise(count=n()) %>% arrange(desc(count))
ggplot(head(city_fraud_case_count,10), aes(x=reorder(city, -count), y=count)
) + geom_bar(stat='identity'
) + geom_text(aes(label=count), vjust=-0.5
) + theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
plot.margin = unit(c(1, 1, 1, 1), "cm"), # Adjusts the margins of the plot
legend.position = "none"
) + labs(
title = "Top 10 cities most often affected by fraud cases",
x = "City name",
y = "Count"
)
# Who are the top 10 merchants most often affected by fraud cases?
merchant_fraud_case_count <- fraud_df[which(fraud_df$is_fraud == "1"),
] %>% group_by(merchant) %>% summarise(count=n()) %>% arrange(desc(count))
ggplot(head(merchant_fraud_case_count,10), aes(x=reorder(merchant, -count), y=count)
) + geom_bar(stat='identity'
) + geom_text(aes(label=count), vjust=-0.5
) + theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
plot.margin = unit(c(1, 1, 1, 1), "cm"), # Adjusts the margins of the plot
legend.position = "none"
) + labs(
title = "Top 10 merchants most often affected by fraud cases",
x = "Merchant name",
y = "Count"
)
adjusted_fraud_df <- fraud_df %>% mutate(is_fraud_label =
ifelse(is_fraud == 0, "Genuine", "Fraud"))
ggplot(adjusted_fraud_df, aes(x = is_fraud_label,
y = age, fill = is_fraud_label)) +
geom_boxplot() +
labs(
title = "How does age differ by Fraud Status?",
x = "Fraud status",
y = "Age"
) +
theme_minimal() +
theme(
strip.background = element_blank(),
strip.text = element_text(size = 12, face = "bold"),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
) + scale_fill_manual(values = c("red", "cyan"))
Independent t-test
\[H_0: \mu_{\text{age_fraud}} =< \mu_{\text{age_non-fraud}}\] \[H_1: \mu_{\text{age_fraud}} > \mu_{\text{age_non-fraud}}\]
genuine_group_with_age <- fraud_df %>% filter(is_fraud==0) %>% select(age)
fraud_group_with_age <- fraud_df %>% filter(is_fraud==1) %>% select(age)
# independent t-test
res <- t.test(fraud_group_with_age, genuine_group_with_age,
alternative='greater', conf.level=0.95)
res
##
## Welch Two Sample t-test
##
## data: fraud_group_with_age and genuine_group_with_age
## t = 6.2496, df = 2539.1, p-value = 2.406e-10
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.655274 Inf
## sample estimates:
## mean of x mean of y
## 47.38756 45.14071
if (res$p.value > 0.05){
print("The mean age of individuals in fraud cases is less than or equal to the mean age of individuals in non-fraud cases.")
} else {
genuine_group_with_age <- fraud_df %>% filter(is_fraud==0) %>% select(age)
print("The mean age of individuals in fraud cases is greater than the mean age of individuals in non-fraud cases.")
}
## [1] "The mean age of individuals in fraud cases is greater than the mean age of individuals in non-fraud cases."
Chi-Square Goodness of Test of Independence
Null hypothesis (H0): There is no association between age group and fraud status. In other words, the distribution of age groups is the same for fraudulent and genuine cases.
Alternative hypothesis (H1): There is an association between age group and fraud status. In other words, the distribution of age groups differs between fraudulent and genuine cases.
# Load required library
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
age_group <- c("0-30", "30-60", "60+")
adjusted_fraud_df <- fraud_df %>% mutate(is_fraud_label = ifelse(is_fraud == 0, "Genuine", "Fraud"))
adjusted_fraud_df$age_group <- cut(adjusted_fraud_df$age,
breaks=c(0, 30, 60, Inf),
labels=age_group,
right=TRUE
)
matrix_data <-adjusted_fraud_df %>% group_by(age_group, is_fraud_label
) %>% reframe(count=n()) %>% spread(is_fraud_label, count)
matrix_data <- as.table(rbind(matrix_data$Fraud, matrix_data$Genuine))
# Perform the Chi-Square Test
res <- chisq.test(matrix_data)
res
##
## Pearson's Chi-squared test
##
## data: matrix_data
## X-squared = 34.876, df = 2, p-value = 2.671e-08
if (res$p.value < 0.05){
print("CONCLUSION: There is a statistically significant association between age group and fraud status at the 0.05 level. The distribution of age groups differs between fraudulent and genuine cases.")
} else {
print("CONCLUSION: There is no statistically significant association between age group and fraud status at the 0.05 level. The distribution of age groups is similar for fraudulent and genuine cases.")
}
## [1] "CONCLUSION: There is a statistically significant association between age group and fraud status at the 0.05 level. The distribution of age groups differs between fraudulent and genuine cases."
# Correlation analysis: what features are correlated?
numerical_vars <- subset(fraud_df, select = c(amt, lat, long, city_pop, merch_lat, merch_long, age))
relation <- cor(numerical_vars, method="pearson")
corrplot(relation, method = 'number')
# Ans: merch_lat is strongly correlated with lat, and merch_long is strongly correlated with long
# we might remove any either of the correlated variable in our statistics and machine learning models
# Load necessary libraries for modelling
library(tibble)
library(dplyr)
library(caret)
library(stringr)
library(tidyr)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ pastecs::extract() masks tidyr::extract()
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ pastecs::first() masks dplyr::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ pastecs::last() masks dplyr::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ plotly::select() masks MASS::select(), dplyr::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:gridExtra':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library(e1071)
library(rsample)
##
## Attaching package: 'rsample'
##
## The following object is masked from 'package:e1071':
##
## permutations
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(xgboost)
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:plotly':
##
## slice
##
## The following object is masked from 'package:dplyr':
##
## slice
library(gbm)
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(ROSE)
## Loaded ROSE 0.0-4
library(class)
# copy the fraud_df dataframe to be used in modelling
model_df <- fraud_df
# Detecting whether the data is balanced or not
class_proportion <- prop.table(table(model_df$is_fraud))
class_proportion
##
## 0 1
## 0.995755167 0.004244833
With ROSE library package, it applies: - Oversampling
for fraud category - Undersampling for non-fraud category
library(ROSE)
# mixed sampling
set.seed(123) # for reproducibility of results
balanced_data <- ovun.sample(is_fraud ~ ., data = model_df, method = "both", N = 10000, p = 0.5, seed = 1)$data
# view the balanced dataset
table(balanced_data$is_fraud)
##
## 0 1
## 5047 4953
# generate a map of the data distribution before processing
plot_original <- ggplot(model_df,aes(x = is_fraud)) +
geom_bar(fill = "skyblue") +
labs(title = "Original Data Distribution", x = "Is Fraud", y = "Count")
# generate a map of the processed data distribution
plot_balanced <- ggplot(balanced_data, aes(x = is_fraud)) +
geom_bar(fill = "lightgreen") +
labs(title = "Balanced Data Distribution", x = "Is Fraud", y = "Count")
# display the two images side by side
grid.arrange(plot_original, plot_balanced, ncol = 2)
# Read and preprocess data
balanced_model_df <- balanced_data %>%
mutate(
category = as.factor(category),
gender = as.factor(gender),
city = as.factor(city),
state = as.factor(state),
job = as.factor(job),
amt = as.integer(amt),
is_fraud = factor(is_fraud, levels = c(0, 1), labels = c("Not_Fraud", "Fraud"))
) %>%
select(-trans_date_trans_time)
# Print the frequency of the is_fraud column
print(table(balanced_model_df$is_fraud))
##
## Not_Fraud Fraud
## 5047 4953
# Set random seed
set.seed(123)
# Add row names as ID column to the dataframe
df_new <- rownames_to_column(balanced_model_df, var = "id") %>% mutate(id = as.integer(id))
# Stratified sampling to create training and testing sets
set.seed(123)
split <- initial_split(df_new, prop = 0.70, strata = "is_fraud")
training <- training(split) %>% as.data.frame()
testing <- testing(split) %>% as.data.frame()
# View the proportion of is_fraud column in training and testing sets
print(paste0("There are ", length(training$is_fraud), " rows of data in training set."))
## [1] "There are 6999 rows of data in training set."
print(paste0("There are ", length(testing$is_fraud), " rows of data in testing set."))
## [1] "There are 3001 rows of data in testing set."
# check if the training dataset is clean from missing values
summary_table <- training %>% reframe(
name=as.vector(names(training)),
count=n(), # total count
unique=sapply(training, function (x){length(unique(x))}), # unique count
missing_value_count=sapply(training, function(x){sum(is.na(x))}) # check missing values
)
summary_table
## name count unique missing_value_count
## 1 id 6999 6999 0
## 2 X 6999 5420 0
## 3 cc_num 6999 875 0
## 4 merchant 6999 690 0
## 5 category 6999 14 0
## 6 amt 6999 838 0
## 7 first 6999 335 0
## 8 last 6999 452 0
## 9 gender 6999 2 0
## 10 street 6999 875 0
## 11 city 6999 812 0
## 12 state 6999 50 0
## 13 zip 6999 865 0
## 14 lat 6999 863 0
## 15 long 6999 863 0
## 16 city_pop 6999 800 0
## 17 job 6999 464 0
## 18 trans_num 6999 5425 0
## 19 unix_time 6999 216 0
## 20 merch_lat 6999 5424 0
## 21 merch_long 6999 5425 0
## 22 is_fraud 6999 2 0
## 23 age 6999 861 0
# Create trainControl object
set.seed(123)
train_control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary)
# Decision tree model fitting
tune_grid <- expand.grid(cp = seq(0.01, 0.1, by = 0.01))
model_dt <- train(
is_fraud ~ amt + gender + city_pop + category +
state + job,
data = training,
method = "rpart",
trControl = train_control,
tuneGrid = tune_grid,
metric = "ROC"
)
# Print the best parameters
print(model_dt$bestTune)
## cp
## 1 0.01
# Model prediction
pred_dt_class <- predict(model_dt, newdata = testing)
cm_dt<-confusionMatrix(pred_dt_class, testing$is_fraud,positive="Fraud")
cm_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not_Fraud Fraud
## Not_Fraud 1404 150
## Fraud 111 1336
##
## Accuracy : 0.913
## 95% CI : (0.9024, 0.9229)
## No Information Rate : 0.5048
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.826
##
## Mcnemar's Test P-Value : 0.01867
##
## Sensitivity : 0.8991
## Specificity : 0.9267
## Pos Pred Value : 0.9233
## Neg Pred Value : 0.9035
## Prevalence : 0.4952
## Detection Rate : 0.4452
## Detection Prevalence : 0.4822
## Balanced Accuracy : 0.9129
##
## 'Positive' Class : Fraud
##
# Random forest model fitting
set.seed(123)
model_rf <- randomForest(
is_fraud ~ amt + city_pop,
data = training,
ntree = 500,
mtry = 2
)
# Model prediction
pred_rf_class <- predict(model_rf, newdata = testing)
cm_rf<-confusionMatrix(pred_rf_class, testing$is_fraud,positive="Fraud")
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not_Fraud Fraud
## Not_Fraud 1381 61
## Fraud 134 1425
##
## Accuracy : 0.935
## 95% CI : (0.9256, 0.9436)
## No Information Rate : 0.5048
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8701
##
## Mcnemar's Test P-Value : 2.522e-07
##
## Sensitivity : 0.9590
## Specificity : 0.9116
## Pos Pred Value : 0.9140
## Neg Pred Value : 0.9577
## Prevalence : 0.4952
## Detection Rate : 0.4748
## Detection Prevalence : 0.5195
## Balanced Accuracy : 0.9353
##
## 'Positive' Class : Fraud
##
# Install and load glmnet
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
# Prepare data for glmnet
x_train <- model.matrix(is_fraud ~ amt + gender + city_pop + category + state, training)[, -1]
y_train <- as.factor(training$is_fraud)
# Fit logistic regression with L1 regularization (lasso)
set.seed(123)
cv_fit <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1)
# Prepare the test data
x_test <- model.matrix(is_fraud ~ amt + gender + city_pop + category + state, testing)[, -1]
# Predict on the test set
pred_lr_proba <- predict(cv_fit, newx = x_test, type = "response", s = "lambda.min")
pred_lr_class <- ifelse(pred_lr_proba > 0.5, "Fraud", "Not_Fraud")
pred_lr_class <- factor(pred_lr_class, levels = levels(testing$is_fraud))
cm_lr<-confusionMatrix(pred_lr_class, testing$is_fraud, positive="Fraud")
cm_lr
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not_Fraud Fraud
## Not_Fraud 1327 365
## Fraud 188 1121
##
## Accuracy : 0.8157
## 95% CI : (0.8014, 0.8295)
## No Information Rate : 0.5048
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.631
##
## Mcnemar's Test P-Value : 7.194e-14
##
## Sensitivity : 0.7544
## Specificity : 0.8759
## Pos Pred Value : 0.8564
## Neg Pred Value : 0.7843
## Prevalence : 0.4952
## Detection Rate : 0.3735
## Detection Prevalence : 0.4362
## Balanced Accuracy : 0.8151
##
## 'Positive' Class : Fraud
##
# KNN model fitting
set.seed(123)
knn_model <- train(
is_fraud ~ amt + gender + city_pop + category + state + job,
data = training,
method = "knn",
trControl = train_control,
metric = "ROC"
)
pred_knn_class <- predict(knn_model, newdata = testing, type="raw")
cm_knn<-confusionMatrix(pred_knn_class, testing$is_fraud,)
cm_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not_Fraud Fraud
## Not_Fraud 1291 128
## Fraud 224 1358
##
## Accuracy : 0.8827
## 95% CI : (0.8707, 0.894)
## No Information Rate : 0.5048
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7655
##
## Mcnemar's Test P-Value : 4.116e-07
##
## Sensitivity : 0.8521
## Specificity : 0.9139
## Pos Pred Value : 0.9098
## Neg Pred Value : 0.8584
## Prevalence : 0.5048
## Detection Rate : 0.4302
## Detection Prevalence : 0.4728
## Balanced Accuracy : 0.8830
##
## 'Positive' Class : Not_Fraud
##
library(pROC)
# Convert factor to numeric
pred_dt_class <- as.numeric(pred_dt_class)
pred_rf_class <- as.numeric(pred_rf_class)
pred_lr_class <- as.numeric(pred_lr_class)
pred_knn_class <- as.numeric(pred_knn_class)
# calculate AUC-ROC
roc_dt <- roc(testing$is_fraud, pred_dt_class, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Setting direction: controls < cases
roc_rf <- roc(testing$is_fraud, pred_rf_class, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Setting direction: controls < cases
roc_lr <-roc(testing$is_fraud, pred_lr_proba, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Warning in roc.default(testing$is_fraud, pred_lr_proba, positive = "Fraud"):
## Deprecated use a matrix as predictor. Unexpected results may be produced,
## please pass a numeric vector.
## Setting direction: controls < cases
roc_knn <-roc(testing$is_fraud, pred_knn_class, positive="Fraud")
## Setting levels: control = Not_Fraud, case = Fraud
## Setting direction: controls < cases
# Plotting both ROC curves on the same plot
auc_labels <- c(paste0("DT (AUC=", round(roc_dt$auc,4), ")"),
paste0("RF (AUC=", round(roc_rf$auc, 4), ")"),
paste0("LR (AUC=", round(roc_lr$auc,4), ")"),
paste0("KNN (AUC=", round(roc_knn$auc, 4), ")"))
ggroc(list(DT= roc_dt, RF = roc_rf, KNN = roc_knn,LR = roc_lr), legacy.axes=TRUE) +
ggtitle("Average ROC Curves") + scale_color_discrete(labels=auc_labels)
# combine all the metrics of different ML models into a table
result_df<-data.frame(
Model_name=c('DECISION TREE','RANDOMFOREST','LOGISTIC REGRESSION','KNN'),
Accuracy=c(cm_dt$overall['Accuracy'],cm_rf$overall['Accuracy'],
cm_lr$overall['Accuracy'],cm_knn$overall['Accuracy']
),
F1_score=c(cm_dt$byClass['F1'],cm_rf$byClass['F1'],
cm_lr$byClass['F1'],cm_knn$byClass['F1']
),
Recall = c(cm_dt$byClass['Recall'],cm_rf$byClass['Recall'],
cm_lr$byClass['Recall'],cm_knn$byClass['Recall']
),
Sensitivity=c(cm_dt$byClass['Sensitivity'],cm_rf$byClass['Sensitivity'],
cm_lr$byClass['Sensitivity'],cm_knn$byClass['Sensitivity']
)
)
# View(result_df)
print(result_df)
## Model_name Accuracy F1_score Recall Sensitivity
## 1 DECISION TREE 0.9130290 0.9110126 0.8990579 0.8990579
## 2 RANDOMFOREST 0.9350217 0.9359606 0.9589502 0.9589502
## 3 LOGISTIC REGRESSION 0.8157281 0.8021467 0.7543742 0.7543742
## 4 KNN 0.8827058 0.8800273 0.8521452 0.8521452
#overall rating
ggplot(result_df,aes(x=Model_name,y=Accuracy, fill=Accuracy)) + geom_bar(stat="identity")