1. Introduction

The purpose of this project

Every year, fraudulent transactions with credit cards result in billions of dollars in losses. The key to minimizing these losses is the development of effective fraud detection algorithms, and increasingly, these algorithms depend on cutting-edge machine learning methods to assist fraud investigators. However, designing fraud detection algorithms is particularly challenging due to the non-stationary distributions of the data, the extremely imbalanced classification distributions, and the continuous streams of transactions. Moreover, obtaining publicly available information is difficult due to confidentiality concerns, which leaves many questions unanswered regarding how to approach this problem

Some facts you need to know about Credit Card Fraud

  • Over 127 million adults in America, which is nearly half of the population, have encountered fraudulent transactions on their credit or debit cards. This highlights the widespread nature of card fraud and its impact on individuals.

  • More than one in three people who use credit or debit cards have experienced card fraud multiple times. This indicates that card fraud is not an isolated incident but a recurring problem for a significant portion of card users.

  • The average fraudulent charge on American credit and debit cards amounts to $62 per transaction. This translates to an estimated total of approximately $8 billion in attempted fraudulent transactions. This staggering amount underscores the financial impact and magnitude of card fraud.

  • Only around 40% of cardholders have activated email or text notifications from their banks or credit card issuers. This means that a large percentage of cardholders may not receive immediate alerts about potentially fraudulent activities on their cards.

  • Among the victims who have enabled notifications, only 19% had to take further action to reverse fraudulent charges. In contrast, approximately 81% of victims who did not have these warnings had to undertake additional measures. This indicates the effectiveness of timely notifications in minimizing the potential consequences of fraudulent charges.

library(vembedr)
embed_youtube("c-DxF1XVATw")

Why can this machine learning model be useful for Credit Fraud Detection?

To find patterns of fraud, credit card fraud detection (CCFD) needs to analyze vast amounts of transaction data. Due to the large data volumes and the constantly evolving tactics of fraudsters, human investigators are unable to effectively address this issue. Over the past decade, machine learning (ML) methods have become an increasingly important component of CCFD as they enable searching and detecting patterns in extensive data sets. It has been demonstrated that ML algorithms can significantly enhance the effectiveness of fraud detection systems and aid fraud investigators in identifying fraudulent transactions.

The model that demonstrates the highest predicted performance in detecting fraud within a specific set of transactions is considered the optimal model for a fraud detection system. By leveraging historical credit card transaction data, we attempted to estimate and predict future fraudulent transactions. Consequently, if fraudulent activities can be forecasted, it would assist fraud investigators in better proposing policies for real-world regulations.

2. Dataset Overview

ULB’s dataset from Kaggle -> https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

The dataset contains credit card transactions done by European cardholders in September 2013. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is quite unbalanced, with frauds making up 0.172% of all transactions in the positive class (frauds) account. There are 284807 observations and 31 columns in this dataset. There are 1 response variable and 30 predictor variables. Additionally, 30 of them are numerical, while 1 is binary. The response variable, “Class,” has a value of 1 in cases of fraud and 0 in all other cases.

Loading Data and Packages

#read data
raw_data <- read.csv("creditcard.csv")
head(raw_data)
##   Time         V1          V2        V3         V4          V5          V6
## 1    0 -1.3598071 -0.07278117 2.5363467  1.3781552 -0.33832077  0.46238778
## 2    0  1.1918571  0.26615071 0.1664801  0.4481541  0.06001765 -0.08236081
## 3    1 -1.3583541 -1.34016307 1.7732093  0.3797796 -0.50319813  1.80049938
## 4    1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888  1.24720317
## 5    2 -1.1582331  0.87773675 1.5487178  0.4030339 -0.40719338  0.09592146
## 6    2 -0.4259659  0.96052304 1.1411093 -0.1682521  0.42098688 -0.02972755
##            V7          V8         V9         V10        V11         V12
## 1  0.23959855  0.09869790  0.3637870  0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298  0.08510165 -0.2554251 -0.16697441  1.6127267  1.06523531
## 3  0.79146096  0.24767579 -1.5146543  0.20764287  0.6245015  0.06608369
## 4  0.23760894  0.37743587 -1.3870241 -0.05495192 -0.2264873  0.17822823
## 5  0.59294075 -0.27053268  0.8177393  0.75307443 -0.8228429  0.53819555
## 6  0.47620095  0.26031433 -0.5686714 -0.37140720  1.3412620  0.35989384
##          V13        V14        V15        V16         V17         V18
## 1 -0.9913898 -0.3111694  1.4681770 -0.4704005  0.20797124  0.02579058
## 2  0.4890950 -0.1437723  0.6355581  0.4639170 -0.11480466 -0.18336127
## 3  0.7172927 -0.1659459  2.3458649 -2.8900832  1.10996938 -0.12135931
## 4  0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279  1.96577500
## 5  1.3458516 -1.1196698  0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337  0.5176168  0.4017259 -0.05813282  0.06865315
##           V19         V20          V21          V22         V23         V24
## 1  0.40399296  0.25141210 -0.018306778  0.277837576 -0.11047391  0.06692807
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953  0.10128802 -0.33984648
## 3 -2.26185710  0.52497973  0.247998153  0.771679402  0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452  0.005273597 -0.19032052 -1.17557533
## 5  0.80348692  0.40854236 -0.009430697  0.798278495 -0.13745808  0.14126698
## 6 -0.03319379  0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
##          V25        V26          V27         V28 Amount Class
## 1  0.1285394 -0.1891148  0.133558377 -0.02105305 149.62     0
## 2  0.1671704  0.1258945 -0.008983099  0.01472417   2.69     0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66     0
## 4  0.6473760 -0.2219288  0.062722849  0.06145763 123.50     0
## 5 -0.2060096  0.5022922  0.219422230  0.21515315  69.99     0
## 6 -0.2327938  0.1059148  0.253844225  0.08108026   3.67     0
summary(raw_data)
##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139320   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount             Class         
##  Min.   :-15.43008   Min.   :    0.00   Min.   :0.000000  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1st Qu.:0.000000  
##  Median :  0.01124   Median :   22.00   Median :0.000000  
##  Mean   :  0.00000   Mean   :   88.35   Mean   :0.001728  
##  3rd Qu.:  0.07828   3rd Qu.:   77.17   3rd Qu.:0.000000  
##  Max.   : 33.84781   Max.   :25691.16   Max.   :1.000000
# Check data Types
glimpse(raw_data)
## Rows: 284,807
## Columns: 31
## $ Time   <dbl> 0, 0, 1, 1, 2, 2, 4, 7, 7, 9, 10, 10, 10, 11, 12, 12, 12, 13, 1…
## $ V1     <dbl> -1.3598071, 1.1918571, -1.3583541, -0.9662717, -1.1582331, -0.4…
## $ V2     <dbl> -0.07278117, 0.26615071, -1.34016307, -0.18522601, 0.87773675, …
## $ V3     <dbl> 2.53634674, 0.16648011, 1.77320934, 1.79299334, 1.54871785, 1.1…
## $ V4     <dbl> 1.37815522, 0.44815408, 0.37977959, -0.86329128, 0.40303393, -0…
## $ V5     <dbl> -0.33832077, 0.06001765, -0.50319813, -0.01030888, -0.40719338,…
## $ V6     <dbl> 0.46238778, -0.08236081, 1.80049938, 1.24720317, 0.09592146, -0…
## $ V7     <dbl> 0.239598554, -0.078802983, 0.791460956, 0.237608940, 0.59294074…
## $ V8     <dbl> 0.098697901, 0.085101655, 0.247675787, 0.377435875, -0.27053267…
## $ V9     <dbl> 0.3637870, -0.2554251, -1.5146543, -1.3870241, 0.8177393, -0.56…
## $ V10    <dbl> 0.09079417, -0.16697441, 0.20764287, -0.05495192, 0.75307443, -…
## $ V11    <dbl> -0.55159953, 1.61272666, 0.62450146, -0.22648726, -0.82284288, …
## $ V12    <dbl> -0.61780086, 1.06523531, 0.06608369, 0.17822823, 0.53819555, 0.…
## $ V13    <dbl> -0.99138985, 0.48909502, 0.71729273, 0.50775687, 1.34585159, -0…
## $ V14    <dbl> -0.31116935, -0.14377230, -0.16594592, -0.28792375, -1.11966983…
## $ V15    <dbl> 1.468176972, 0.635558093, 2.345864949, -0.631418118, 0.17512113…
## $ V16    <dbl> -0.47040053, 0.46391704, -2.89008319, -1.05964725, -0.45144918,…
## $ V17    <dbl> 0.207971242, -0.114804663, 1.109969379, -0.684092786, -0.237033…
## $ V18    <dbl> 0.02579058, -0.18336127, -0.12135931, 1.96577500, -0.03819479, …
## $ V19    <dbl> 0.40399296, -0.14578304, -2.26185710, -1.23262197, 0.80348692, …
## $ V20    <dbl> 0.25141210, -0.06908314, 0.52497973, -0.20803778, 0.40854236, 0…
## $ V21    <dbl> -0.018306778, -0.225775248, 0.247998153, -0.108300452, -0.00943…
## $ V22    <dbl> 0.277837576, -0.638671953, 0.771679402, 0.005273597, 0.79827849…
## $ V23    <dbl> -0.110473910, 0.101288021, 0.909412262, -0.190320519, -0.137458…
## $ V24    <dbl> 0.06692807, -0.33984648, -0.68928096, -1.17557533, 0.14126698, …
## $ V25    <dbl> 0.12853936, 0.16717040, -0.32764183, 0.64737603, -0.20600959, -…
## $ V26    <dbl> -0.18911484, 0.12589453, -0.13909657, -0.22192884, 0.50229222, …
## $ V27    <dbl> 0.133558377, -0.008983099, -0.055352794, 0.062722849, 0.2194222…
## $ V28    <dbl> -0.021053053, 0.014724169, -0.059751841, 0.061457629, 0.2151531…
## $ Amount <dbl> 149.62, 2.69, 378.66, 123.50, 69.99, 3.67, 4.99, 40.80, 93.20, …
## $ Class  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

3 Data Cleansing

Let’s now determine whether or not our response variable class is balanced. If not, we must resolve the situation.

table(raw_data$Class)
## 
##      0      1 
## 284315    492

We can tell our response variable class is highly unbalanced. Observations on “0” class are far more frequent than “1” class. We need to use some functions to address this problem, otherwise this will have a significant impact on our prediction models

# Displaying the class distribution before downsampling.
cat("Distribution before downsampling:\n")
## Distribution before downsampling:
table(raw_data$Class)
## 
##      0      1 
## 284315    492
# Determining the number of samples to be retained from the minority class.
n_minority <- sum(raw_data$Class == "1")  # Number of samples in the minority class

# Obtaining the indices of samples in the majority class
majority_indices <- which(raw_data$Class == "0")

# Performing random sampling on the indices of the majority class.
downsampled_indices <- sample(majority_indices, n_minority)

# Combining the downsampled indices of the majority class with the indices of the minority class
downsampled_indices <- c(downsampled_indices, which(raw_data$Class == "1"))

# Creating a new dataset that has been downsampled based on the combined indices of the majority and minority classes
downsampled_data <- raw_data[downsampled_indices, ]

# Displaying the class distribution after downsampling.
cat("\nDistribution after downsampling:\n")
## 
## Distribution after downsampling:
table(downsampled_data$Class)
## 
##   0   1 
## 492 492
# Saving the downsampled data to a CSV file.
write.csv(downsampled_data, "creditcard_downsampled.csv", row.names = FALSE)

creditc <- downsampled_data
# Convert class to factor
creditc <- creditc %>%
  mutate(Class = factor(Class, levels = c("1", "0"))) 
#Summary
summary(creditc$Amount)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    1.365   16.835   98.494   99.990 2125.870
var(creditc$Amount)
## [1] 44173.23
# show how many observations and variables in the new dataset
dim(creditc) 
## [1] 984  31
#Clean name
creditc$Amount <- scale(creditc$Amount)
head(creditc)
##          Time         V1         V2         V3            V4         V5
## 119484  75465 -0.5589923  0.8272518  1.7469952  0.8799178436 -0.1182000
## 122598  76620  1.2953449  0.2705230  0.3140776  0.4996269130 -0.1959865
## 69684   53533 -1.6836962  1.5372563  1.0837728 -0.0009465908 -0.9655698
## 79030   57829 -0.3386080  1.1988628  1.3099733  0.0519260700  0.1052201
## 174498 121936  2.2957934 -1.6682370 -1.1699644 -1.8603757992 -1.0229363
## 197576 132080 -0.4573441  0.8420100  0.1848975 -0.5783188115  0.8285718
##                 V6          V7         V8           V9         V10         V11
## 119484 -0.09072459  1.38037720 -0.3346847 -0.265729055 -0.34690947  0.01592209
## 122598 -0.69668714  0.08350280 -0.2005205  0.007864602 -0.07889945 -0.58105940
## 69684  -0.46786851  0.09802389  0.1481253  0.114004314  1.37552162  1.80572614
## 79030  -0.95631832  0.77787430 -0.1245373 -0.531239701 -0.53884705 -0.06277031
## 174498  0.23780798 -1.49025840  0.1367276 -0.930216036  1.76372489 -0.80898522
## 197576 -0.20920733  0.73169645  0.1255633 -0.031736309 -0.97093087  0.52759036
##               V12        V13         V14        V15        V16         V17
## 119484  0.6686735  0.4527749 -0.41002045 -0.3777227 -1.0764069  0.30759798
## 122598  0.4553171  0.8731735  0.11576819  1.0166372  0.4260015 -0.63098867
## 69684   1.0245117  0.7799514 -0.02593920  1.2492190 -0.2757981  0.04723097
## 79030   0.3960898  1.0162482 -0.60831904  0.7956086  0.3420571 -0.04473619
## 174498 -1.4579160 -1.1740374 -0.06673544 -0.4278678  0.1817296 -0.14832796
## 197576 -0.4110803 -1.9819051 -0.78744287 -1.3104270  0.1263313  0.65337535
##               V18         V19         V20         V21        V22         V23
## 119484 -1.0941496 -0.40821083 -0.03884329 -0.02325090  0.3272196 -0.08957233
## 122598 -0.4375516  0.06558043 -0.04657746 -0.28466544 -0.7976893  0.06908271
## 69684   0.2368222  1.92076336  0.61810983 -0.07620336  0.1345966  0.07812739
## 79030  -0.2470371 -0.09546839  0.17185667 -0.25370541 -0.6164204 -0.02545672
## 174498  1.0837697  0.52363465 -0.47679683 -0.18918643 -0.1479315  0.09402840
## 197576  0.7449706  0.12038828 -0.20722138  0.12888272  0.4086537 -0.31133035
##               V24        V25         V26         V27         V28      Amount
## 119484  0.6305936  0.1386869 -0.42697994 -0.15739459 -0.18903741  0.01092357
## 122598 -0.1070275  0.2997970  0.12389192 -0.02666722  0.01501109 -0.45921009
## 69684   0.6079298 -0.5731455  1.02604164 -0.06368674  0.18862975 -0.31161828
## 79030   0.3532177 -0.1294718  0.06916653  0.25142047  0.10013948 -0.45021755
## 174498 -1.5135345 -0.2104134 -0.10060708  0.01010621 -0.07060183 -0.33160170
## 197576  0.6372314 -0.1241656  0.54684579 -0.04533557  0.15203719 -0.44041616
##        Class
## 119484     0
## 122598     0
## 69684      0
## 79030      0
## 174498     0
## 197576     0

We completed the the process of data cleaning.

4. Data Split

The data was stratified sampling by Class , and spitted to 70% training set and 30% testing set.

set.seed(2022)
creditc_split <- initial_split(creditc, prop = 0.70, strata = Class)
# Training Dataset
creditc_train <- training(creditc_split)
# Testing Dataset
creditc_test <- testing(creditc_split)

# check dimension
dim(creditc_train)
## [1] 688  31
dim(creditc_test)
## [1] 296  31

5. Data Exploration

After employing the ovun.sample() function to processing the data, we can see that the number of card fraud is balanced from the table

table(creditc_train$Class)
## 
##   1   0 
## 344 344

Target Variable Data Transformation and Analysis

## Target Variable `amount` Analysis
creditc_train$Amount %>% summary()
##        V1           
##  Min.   :-0.468631  
##  1st Qu.:-0.461506  
##  Median :-0.375874  
##  Mean   : 0.030874  
##  3rd Qu.: 0.007117  
##  Max.   : 8.141747
creditc_train %>% ggplot(aes(Amount)) +
  geom_histogram(bins=30) +
  scale_x_log10() +
  labs(
  x = "Dollar Amount (Log Scale)", y = "Frequency (Count)",
  title= "Distribution of Transaction Amount (Log Scaled)"
 )
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 504 rows containing non-finite values (`stat_bin()`).

-> Target Variable amount Analysis

There will be minimal probability for any outliers among the data values for V1, V2,…, V28 since the majority of predictors have been modified. Therefore, as Amount is the only useful numerical attribute, we shall just look at it.

creditc_train %>% ggplot(aes(x=Amount)) +
  geom_boxplot() +
  labs(x = "Amount ($USD)", title= "Distribution of Transaction Amount")

We can see a significant number of outliers on the higher end of the distribution from the boxplot above. It would signify transactions involving large amounts of money in thousands. When developing the predictive models, we would consider how this skewed distribution might affect data transformation or the choice of models that are resistant to such feature types.

In order to analysis the variable time, we will examine transaction time to look for any abnormalities. We will create a scatterplot using only the fraud dataset.

## Target Variable `time` Analysis
## Are there any tendency in time where fraud occurred?
# Splitting data by fraud class
CC_no_fraud <- creditc_train %>% filter(Class == 0)
CC_fraud <- creditc_train %>% filter(Class == 1)
# Scatterplot

# Scatterplot

CC_fraud %>% ggplot(aes(x=Time, y=Amount)) +
  geom_point() +
  labs(
  y = "Amount ($)", 
  x = "Time (s)",
  title= "Fraudulent Transactions Across Time"
 )

-> Target Variable time Analysis

There doesn’t seem to be a clustering structure on a timespan in the graph above. Therefore, we would suppose that fraud happened relatively randomly throughout time.

6. Model fitting

The goal of model selection is to choose the model that will produce the best predictions on upcoming data. The model with the highest predicted fraud detection performances on the following block of transactions is the optimal model for a fraud detection system

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.4     ✔ recipes      1.0.6
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ randomForest::combine()  masks dplyr::combine()
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ dbplyr::ident()          masks dplyr::ident()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ caret::lift()            masks purrr::lift()
## ✖ randomForest::margin()   masks ggplot2::margin()
## ✖ rsample::permutations()  masks e1071::permutations()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ dbplyr::sql()            masks dplyr::sql()
## ✖ recipes::step()          masks stats::step()
## ✖ tune::tune()             masks parsnip::tune(), e1071::tune()
## • Search for functions across packages at https://www.tidymodels.org/find/
# Define the recipe for data preprocessing
creditc_recipe <- recipe(Class ~ ., data = creditc_train) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors())

Model Logistic Regression

library(tidymodels)

# Define the logistic regression model
log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

# Create a workflow with the logistic regression model and recipe
log_wkflow <- workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(creditc_recipe)

# Fit the model using the workflow and training data
log_fit <- fit(log_wkflow, data = creditc_train)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Fitting testing data

# Perform predictions on the test data
log_test <- predict(log_fit, new_data = creditc_test) %>% 
  bind_cols(creditc_test %>% select(Class))
# Calculate accuracy
accuracy(log_test, truth = Class, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.926

Based on the table, we can see that the Logistic Regression model did a great prediction with 0.9256757 accuracy.

Confusion matrix

library(yardstick)
library(ggplot2)

log_test %>%
  conf_mat(truth = Class, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

#### ROC

# Generate predictions and calculate AUC
log_predictions <- predict(log_fit, creditc_test, type = "prob")
log_test <- tibble::add_column(creditc_test, .pred_1 = log_predictions$.pred_1)

# Plot the ROC curve
log_test %>%
  roc_curve(Class, .pred_1) %>%
  autoplot()

# Generate predictions and calculate AUC
log_predictions <- predict(log_fit, creditc_test, type = "prob")
log_test <- tibble::add_column(creditc_test, .pred_1 = log_predictions$.pred_1)

# Plot the ROC curve
log_test %>%
  roc_auc(Class, .pred_1)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.975

The reliability of our model is also confirmed by the confusion matrix, 0.9324324 accuracy, and 0.9732012 ROC_AUC solid performance. There are 288 of the 291 observations in the matrix were correctly predicted by the Logistic Regression, and the curve is virtually at the left-top corner.

Model Nearest Neighbors

The Nearest Neighbor model is then applied. Folding the training data is where we start. Utilize k-fold cross-validation with k=5.

creditc_fold <- vfold_cv(creditc_train, v = 5, strata = Class)

Set up

knn_model <- nearest_neighbor(neighbors = tune(),
            mode = "classification") %>% 
            set_engine("kknn")

knn_workflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(creditc_recipe)

# set-up tuning grid 
knn_params <- parameters(knn_model)
## Warning: `parameters.model_spec()` was deprecated in tune 0.1.6.9003.
## ℹ Please use `hardhat::extract_parameter_set_dials()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# define grid
knn_grid <- grid_regular(knn_params, levels = 2)

Tune the model

knn_tune <- knn_workflow %>% 
  tune_grid(resamples = creditc_fold, 
            grid = knn_grid)
arrange(collect_metrics(knn_tune),desc(mean))
## # A tibble: 4 × 7
##   neighbors .metric  .estimator  mean     n std_err .config             
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 roc_auc  binary     0.966     5 0.00959 Preprocessor1_Model2
## 2        15 accuracy binary     0.927     5 0.0167  Preprocessor1_Model2
## 3         1 accuracy binary     0.914     5 0.0205  Preprocessor1_Model1
## 4         1 roc_auc  binary     0.914     5 0.0205  Preprocessor1_Model1

Fit the Knearest model

I using the best parameter to fit the model.

best_comp <- select_best(knn_tune, metric = "roc_auc")
creditc_final <- finalize_workflow(knn_workflow, best_comp)
knn_fit <- fit(creditc_final,data = creditc_train)

augment(knn_fit, new_data = creditc_test) %>%
  accuracy(truth = Class, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.916

Confusion matrix

We can use the heat map to clearly see the prediction.

augment(knn_fit, new_data = creditc_test) %>%
  conf_mat(truth = Class, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

# Calculate AUC
augment(knn_fit, new_data = creditc_test) %>%
  roc_auc(Class, .pred_1)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.947

We can see the Nearest Neighbors have 0.9087838 accuracy and high ROC_AUC with 0.9485482 and have successful predicted 269 of 296 observations from the matrix.

7. Conclusion

Credit card fraud detection is a challenging problem that requires analyzing large amounts of transaction data to identify patterns of fraud. For the purposes of this project, I trained two prediction models to perform the same forecasting task and then compared the results to decide the final “best” forecast model with the highest accuracy. I tried to deal with imbalanced datasets using a sampling technique, specifically a credit card fraud transaction dataset where the proportion of fraudulent cases to total transactions is quite small. Since I balanced the data before training the model, I can use both the confusion matrix accuracy and the accuracy using the Area Under the Precision-Recall Curve (AUC) to analyze the predictions of my models. Although Nearest Neighbors models performed well, the Logistic Regression model yielded the highest accuracy of 0.9324324 with an AUC of 0.9732012. I will present the Logistic Regression as my final model. The hope is that in the near future, more accurate fraud detection systems can be developed to assist fraud investigators in detecting fraudulent transactions and proposing better policies for real-world regulation.

8.Refrences