Every year, fraudulent transactions with credit cards result in billions of dollars in losses. The key to minimizing these losses is the development of effective fraud detection algorithms, and increasingly, these algorithms depend on cutting-edge machine learning methods to assist fraud investigators. However, designing fraud detection algorithms is particularly challenging due to the non-stationary distributions of the data, the extremely imbalanced classification distributions, and the continuous streams of transactions. Moreover, obtaining publicly available information is difficult due to confidentiality concerns, which leaves many questions unanswered regarding how to approach this problem
Over 127 million adults in America, which is nearly half of the population, have encountered fraudulent transactions on their credit or debit cards. This highlights the widespread nature of card fraud and its impact on individuals.
More than one in three people who use credit or debit cards have experienced card fraud multiple times. This indicates that card fraud is not an isolated incident but a recurring problem for a significant portion of card users.
The average fraudulent charge on American credit and debit cards amounts to $62 per transaction. This translates to an estimated total of approximately $8 billion in attempted fraudulent transactions. This staggering amount underscores the financial impact and magnitude of card fraud.
Only around 40% of cardholders have activated email or text notifications from their banks or credit card issuers. This means that a large percentage of cardholders may not receive immediate alerts about potentially fraudulent activities on their cards.
Among the victims who have enabled notifications, only 19% had to take further action to reverse fraudulent charges. In contrast, approximately 81% of victims who did not have these warnings had to undertake additional measures. This indicates the effectiveness of timely notifications in minimizing the potential consequences of fraudulent charges.
library(vembedr)
embed_youtube("c-DxF1XVATw")
To find patterns of fraud, credit card fraud detection (CCFD) needs to analyze vast amounts of transaction data. Due to the large data volumes and the constantly evolving tactics of fraudsters, human investigators are unable to effectively address this issue. Over the past decade, machine learning (ML) methods have become an increasingly important component of CCFD as they enable searching and detecting patterns in extensive data sets. It has been demonstrated that ML algorithms can significantly enhance the effectiveness of fraud detection systems and aid fraud investigators in identifying fraudulent transactions.
The model that demonstrates the highest predicted performance in detecting fraud within a specific set of transactions is considered the optimal model for a fraud detection system. By leveraging historical credit card transaction data, we attempted to estimate and predict future fraudulent transactions. Consequently, if fraudulent activities can be forecasted, it would assist fraud investigators in better proposing policies for real-world regulations.
ULB’s dataset from Kaggle -> https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
The dataset contains credit card transactions done by European cardholders in September 2013. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is quite unbalanced, with frauds making up 0.172% of all transactions in the positive class (frauds) account. There are 284807 observations and 31 columns in this dataset. There are 1 response variable and 30 predictor variables. Additionally, 30 of them are numerical, while 1 is binary. The response variable, “Class,” has a value of 1 in cases of fraud and 0 in all other cases.
#read data
raw_data <- read.csv("creditcard.csv")
head(raw_data)
## Time V1 V2 V3 V4 V5 V6
## 1 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 2 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 3 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 4 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 5 2 -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146
## 6 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 1 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 2 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 3 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 4 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 5 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 6 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 1 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 2 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 3 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 4 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 5 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 6 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23 V24
## 1 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391 0.06692807
## 2 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802 -0.33984648
## 3 -2.26185710 0.52497973 0.247998153 0.771679402 0.90941226 -0.68928096
## 4 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052 -1.17557533
## 5 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808 0.14126698
## 6 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767 -0.37142658
## V25 V26 V27 V28 Amount Class
## 1 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0
## 2 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0
## 3 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 4 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0
## 5 -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0
## 6 -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
summary(raw_data)
## Time V1 V2 V3
## Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
## Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
## Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:139320 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
## Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
## V4 V5 V6 V7
## Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
## 1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
## Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
## Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
## V8 V9 V10 V11
## Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14 V15
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
## Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
## V16 V17 V18
## Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
## 1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
## Median : 0.06641 Median : -0.06568 Median :-0.003636
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
## Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
## V19 V20 V21
## Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
## 1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
## Median : 0.003735 Median : -0.06248 Median : -0.02945
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
## Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
## V22 V23 V24
## Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
## 1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
## Median : 0.006782 Median : -0.01119 Median : 0.04098
## Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
## Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
## V25 V26 V27
## Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
## 1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
## Median : 0.01659 Median :-0.05214 Median : 0.001342
## Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
## Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
## V28 Amount Class
## Min. :-15.43008 Min. : 0.00 Min. :0.000000
## 1st Qu.: -0.05296 1st Qu.: 5.60 1st Qu.:0.000000
## Median : 0.01124 Median : 22.00 Median :0.000000
## Mean : 0.00000 Mean : 88.35 Mean :0.001728
## 3rd Qu.: 0.07828 3rd Qu.: 77.17 3rd Qu.:0.000000
## Max. : 33.84781 Max. :25691.16 Max. :1.000000
# Check data Types
glimpse(raw_data)
## Rows: 284,807
## Columns: 31
## $ Time <dbl> 0, 0, 1, 1, 2, 2, 4, 7, 7, 9, 10, 10, 10, 11, 12, 12, 12, 13, 1…
## $ V1 <dbl> -1.3598071, 1.1918571, -1.3583541, -0.9662717, -1.1582331, -0.4…
## $ V2 <dbl> -0.07278117, 0.26615071, -1.34016307, -0.18522601, 0.87773675, …
## $ V3 <dbl> 2.53634674, 0.16648011, 1.77320934, 1.79299334, 1.54871785, 1.1…
## $ V4 <dbl> 1.37815522, 0.44815408, 0.37977959, -0.86329128, 0.40303393, -0…
## $ V5 <dbl> -0.33832077, 0.06001765, -0.50319813, -0.01030888, -0.40719338,…
## $ V6 <dbl> 0.46238778, -0.08236081, 1.80049938, 1.24720317, 0.09592146, -0…
## $ V7 <dbl> 0.239598554, -0.078802983, 0.791460956, 0.237608940, 0.59294074…
## $ V8 <dbl> 0.098697901, 0.085101655, 0.247675787, 0.377435875, -0.27053267…
## $ V9 <dbl> 0.3637870, -0.2554251, -1.5146543, -1.3870241, 0.8177393, -0.56…
## $ V10 <dbl> 0.09079417, -0.16697441, 0.20764287, -0.05495192, 0.75307443, -…
## $ V11 <dbl> -0.55159953, 1.61272666, 0.62450146, -0.22648726, -0.82284288, …
## $ V12 <dbl> -0.61780086, 1.06523531, 0.06608369, 0.17822823, 0.53819555, 0.…
## $ V13 <dbl> -0.99138985, 0.48909502, 0.71729273, 0.50775687, 1.34585159, -0…
## $ V14 <dbl> -0.31116935, -0.14377230, -0.16594592, -0.28792375, -1.11966983…
## $ V15 <dbl> 1.468176972, 0.635558093, 2.345864949, -0.631418118, 0.17512113…
## $ V16 <dbl> -0.47040053, 0.46391704, -2.89008319, -1.05964725, -0.45144918,…
## $ V17 <dbl> 0.207971242, -0.114804663, 1.109969379, -0.684092786, -0.237033…
## $ V18 <dbl> 0.02579058, -0.18336127, -0.12135931, 1.96577500, -0.03819479, …
## $ V19 <dbl> 0.40399296, -0.14578304, -2.26185710, -1.23262197, 0.80348692, …
## $ V20 <dbl> 0.25141210, -0.06908314, 0.52497973, -0.20803778, 0.40854236, 0…
## $ V21 <dbl> -0.018306778, -0.225775248, 0.247998153, -0.108300452, -0.00943…
## $ V22 <dbl> 0.277837576, -0.638671953, 0.771679402, 0.005273597, 0.79827849…
## $ V23 <dbl> -0.110473910, 0.101288021, 0.909412262, -0.190320519, -0.137458…
## $ V24 <dbl> 0.06692807, -0.33984648, -0.68928096, -1.17557533, 0.14126698, …
## $ V25 <dbl> 0.12853936, 0.16717040, -0.32764183, 0.64737603, -0.20600959, -…
## $ V26 <dbl> -0.18911484, 0.12589453, -0.13909657, -0.22192884, 0.50229222, …
## $ V27 <dbl> 0.133558377, -0.008983099, -0.055352794, 0.062722849, 0.2194222…
## $ V28 <dbl> -0.021053053, 0.014724169, -0.059751841, 0.061457629, 0.2151531…
## $ Amount <dbl> 149.62, 2.69, 378.66, 123.50, 69.99, 3.67, 4.99, 40.80, 93.20, …
## $ Class <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Let’s now determine whether or not our response variable class is balanced. If not, we must resolve the situation.
table(raw_data$Class)
##
## 0 1
## 284315 492
We can tell our response variable class is highly unbalanced. Observations on “0” class are far more frequent than “1” class. We need to use some functions to address this problem, otherwise this will have a significant impact on our prediction models
# Displaying the class distribution before downsampling.
cat("Distribution before downsampling:\n")
## Distribution before downsampling:
table(raw_data$Class)
##
## 0 1
## 284315 492
# Determining the number of samples to be retained from the minority class.
n_minority <- sum(raw_data$Class == "1") # Number of samples in the minority class
# Obtaining the indices of samples in the majority class
majority_indices <- which(raw_data$Class == "0")
# Performing random sampling on the indices of the majority class.
downsampled_indices <- sample(majority_indices, n_minority)
# Combining the downsampled indices of the majority class with the indices of the minority class
downsampled_indices <- c(downsampled_indices, which(raw_data$Class == "1"))
# Creating a new dataset that has been downsampled based on the combined indices of the majority and minority classes
downsampled_data <- raw_data[downsampled_indices, ]
# Displaying the class distribution after downsampling.
cat("\nDistribution after downsampling:\n")
##
## Distribution after downsampling:
table(downsampled_data$Class)
##
## 0 1
## 492 492
# Saving the downsampled data to a CSV file.
write.csv(downsampled_data, "creditcard_downsampled.csv", row.names = FALSE)
creditc <- downsampled_data
# Convert class to factor
creditc <- creditc %>%
mutate(Class = factor(Class, levels = c("1", "0")))
#Summary
summary(creditc$Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.365 16.835 98.494 99.990 2125.870
var(creditc$Amount)
## [1] 44173.23
# show how many observations and variables in the new dataset
dim(creditc)
## [1] 984 31
#Clean name
creditc$Amount <- scale(creditc$Amount)
head(creditc)
## Time V1 V2 V3 V4 V5
## 119484 75465 -0.5589923 0.8272518 1.7469952 0.8799178436 -0.1182000
## 122598 76620 1.2953449 0.2705230 0.3140776 0.4996269130 -0.1959865
## 69684 53533 -1.6836962 1.5372563 1.0837728 -0.0009465908 -0.9655698
## 79030 57829 -0.3386080 1.1988628 1.3099733 0.0519260700 0.1052201
## 174498 121936 2.2957934 -1.6682370 -1.1699644 -1.8603757992 -1.0229363
## 197576 132080 -0.4573441 0.8420100 0.1848975 -0.5783188115 0.8285718
## V6 V7 V8 V9 V10 V11
## 119484 -0.09072459 1.38037720 -0.3346847 -0.265729055 -0.34690947 0.01592209
## 122598 -0.69668714 0.08350280 -0.2005205 0.007864602 -0.07889945 -0.58105940
## 69684 -0.46786851 0.09802389 0.1481253 0.114004314 1.37552162 1.80572614
## 79030 -0.95631832 0.77787430 -0.1245373 -0.531239701 -0.53884705 -0.06277031
## 174498 0.23780798 -1.49025840 0.1367276 -0.930216036 1.76372489 -0.80898522
## 197576 -0.20920733 0.73169645 0.1255633 -0.031736309 -0.97093087 0.52759036
## V12 V13 V14 V15 V16 V17
## 119484 0.6686735 0.4527749 -0.41002045 -0.3777227 -1.0764069 0.30759798
## 122598 0.4553171 0.8731735 0.11576819 1.0166372 0.4260015 -0.63098867
## 69684 1.0245117 0.7799514 -0.02593920 1.2492190 -0.2757981 0.04723097
## 79030 0.3960898 1.0162482 -0.60831904 0.7956086 0.3420571 -0.04473619
## 174498 -1.4579160 -1.1740374 -0.06673544 -0.4278678 0.1817296 -0.14832796
## 197576 -0.4110803 -1.9819051 -0.78744287 -1.3104270 0.1263313 0.65337535
## V18 V19 V20 V21 V22 V23
## 119484 -1.0941496 -0.40821083 -0.03884329 -0.02325090 0.3272196 -0.08957233
## 122598 -0.4375516 0.06558043 -0.04657746 -0.28466544 -0.7976893 0.06908271
## 69684 0.2368222 1.92076336 0.61810983 -0.07620336 0.1345966 0.07812739
## 79030 -0.2470371 -0.09546839 0.17185667 -0.25370541 -0.6164204 -0.02545672
## 174498 1.0837697 0.52363465 -0.47679683 -0.18918643 -0.1479315 0.09402840
## 197576 0.7449706 0.12038828 -0.20722138 0.12888272 0.4086537 -0.31133035
## V24 V25 V26 V27 V28 Amount
## 119484 0.6305936 0.1386869 -0.42697994 -0.15739459 -0.18903741 0.01092357
## 122598 -0.1070275 0.2997970 0.12389192 -0.02666722 0.01501109 -0.45921009
## 69684 0.6079298 -0.5731455 1.02604164 -0.06368674 0.18862975 -0.31161828
## 79030 0.3532177 -0.1294718 0.06916653 0.25142047 0.10013948 -0.45021755
## 174498 -1.5135345 -0.2104134 -0.10060708 0.01010621 -0.07060183 -0.33160170
## 197576 0.6372314 -0.1241656 0.54684579 -0.04533557 0.15203719 -0.44041616
## Class
## 119484 0
## 122598 0
## 69684 0
## 79030 0
## 174498 0
## 197576 0
We completed the the process of data cleaning.
The data was stratified sampling by Class , and spitted to 70% training set and 30% testing set.
set.seed(2022)
creditc_split <- initial_split(creditc, prop = 0.70, strata = Class)
# Training Dataset
creditc_train <- training(creditc_split)
# Testing Dataset
creditc_test <- testing(creditc_split)
# check dimension
dim(creditc_train)
## [1] 688 31
dim(creditc_test)
## [1] 296 31
After employing the ovun.sample() function to processing the data, we can see that the number of card fraud is balanced from the table
table(creditc_train$Class)
##
## 1 0
## 344 344
## Target Variable `amount` Analysis
creditc_train$Amount %>% summary()
## V1
## Min. :-0.468631
## 1st Qu.:-0.461506
## Median :-0.375874
## Mean : 0.030874
## 3rd Qu.: 0.007117
## Max. : 8.141747
creditc_train %>% ggplot(aes(Amount)) +
geom_histogram(bins=30) +
scale_x_log10() +
labs(
x = "Dollar Amount (Log Scale)", y = "Frequency (Count)",
title= "Distribution of Transaction Amount (Log Scaled)"
)
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 504 rows containing non-finite values (`stat_bin()`).
-> Target Variable amount Analysis
There will be minimal probability for any outliers among the data values for V1, V2,…, V28 since the majority of predictors have been modified. Therefore, as Amount is the only useful numerical attribute, we shall just look at it.
creditc_train %>% ggplot(aes(x=Amount)) +
geom_boxplot() +
labs(x = "Amount ($USD)", title= "Distribution of Transaction Amount")
We can see a significant number of outliers on the higher end of the distribution from the boxplot above. It would signify transactions involving large amounts of money in thousands. When developing the predictive models, we would consider how this skewed distribution might affect data transformation or the choice of models that are resistant to such feature types.
In order to analysis the variable time, we will examine transaction time to look for any abnormalities. We will create a scatterplot using only the fraud dataset.
## Target Variable `time` Analysis
## Are there any tendency in time where fraud occurred?
# Splitting data by fraud class
CC_no_fraud <- creditc_train %>% filter(Class == 0)
CC_fraud <- creditc_train %>% filter(Class == 1)
# Scatterplot
# Scatterplot
CC_fraud %>% ggplot(aes(x=Time, y=Amount)) +
geom_point() +
labs(
y = "Amount ($)",
x = "Time (s)",
title= "Fraudulent Transactions Across Time"
)
-> Target Variable time Analysis
There doesn’t seem to be a clustering structure on a timespan in the graph above. Therefore, we would suppose that fraud happened relatively randomly throughout time.
The goal of model selection is to choose the model that will produce the best predictions on upcoming data. The model with the highest predicted fraud detection performances on the following block of transactions is the optimal model for a fraud detection system
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom 1.0.4 ✔ recipes 1.0.6
## ✔ dials 1.2.0 ✔ tune 1.1.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.0 ✔ yardstick 1.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dbplyr::ident() masks dplyr::ident()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ dbplyr::sql() masks dplyr::sql()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
## • Search for functions across packages at https://www.tidymodels.org/find/
# Define the recipe for data preprocessing
creditc_recipe <- recipe(Class ~ ., data = creditc_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
library(tidymodels)
# Define the logistic regression model
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Create a workflow with the logistic regression model and recipe
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(creditc_recipe)
# Fit the model using the workflow and training data
log_fit <- fit(log_wkflow, data = creditc_train)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Perform predictions on the test data
log_test <- predict(log_fit, new_data = creditc_test) %>%
bind_cols(creditc_test %>% select(Class))
# Calculate accuracy
accuracy(log_test, truth = Class, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.926
Based on the table, we can see that the Logistic Regression model did a great prediction with 0.9256757 accuracy.
library(yardstick)
library(ggplot2)
log_test %>%
conf_mat(truth = Class, estimate = .pred_class) %>%
autoplot(type = "heatmap")
#### ROC
# Generate predictions and calculate AUC
log_predictions <- predict(log_fit, creditc_test, type = "prob")
log_test <- tibble::add_column(creditc_test, .pred_1 = log_predictions$.pred_1)
# Plot the ROC curve
log_test %>%
roc_curve(Class, .pred_1) %>%
autoplot()
# Generate predictions and calculate AUC
log_predictions <- predict(log_fit, creditc_test, type = "prob")
log_test <- tibble::add_column(creditc_test, .pred_1 = log_predictions$.pred_1)
# Plot the ROC curve
log_test %>%
roc_auc(Class, .pred_1)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.975
The reliability of our model is also confirmed by the confusion matrix, 0.9324324 accuracy, and 0.9732012 ROC_AUC solid performance. There are 288 of the 291 observations in the matrix were correctly predicted by the Logistic Regression, and the curve is virtually at the left-top corner.
The Nearest Neighbor model is then applied. Folding the training data is where we start. Utilize k-fold cross-validation with k=5.
creditc_fold <- vfold_cv(creditc_train, v = 5, strata = Class)
knn_model <- nearest_neighbor(neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
knn_workflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(creditc_recipe)
# set-up tuning grid
knn_params <- parameters(knn_model)
## Warning: `parameters.model_spec()` was deprecated in tune 0.1.6.9003.
## ℹ Please use `hardhat::extract_parameter_set_dials()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# define grid
knn_grid <- grid_regular(knn_params, levels = 2)
knn_tune <- knn_workflow %>%
tune_grid(resamples = creditc_fold,
grid = knn_grid)
arrange(collect_metrics(knn_tune),desc(mean))
## # A tibble: 4 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 roc_auc binary 0.966 5 0.00959 Preprocessor1_Model2
## 2 15 accuracy binary 0.927 5 0.0167 Preprocessor1_Model2
## 3 1 accuracy binary 0.914 5 0.0205 Preprocessor1_Model1
## 4 1 roc_auc binary 0.914 5 0.0205 Preprocessor1_Model1
I using the best parameter to fit the model.
best_comp <- select_best(knn_tune, metric = "roc_auc")
creditc_final <- finalize_workflow(knn_workflow, best_comp)
knn_fit <- fit(creditc_final,data = creditc_train)
augment(knn_fit, new_data = creditc_test) %>%
accuracy(truth = Class, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.916
We can use the heat map to clearly see the prediction.
augment(knn_fit, new_data = creditc_test) %>%
conf_mat(truth = Class, estimate = .pred_class) %>%
autoplot(type = "heatmap")
# Calculate AUC
augment(knn_fit, new_data = creditc_test) %>%
roc_auc(Class, .pred_1)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.947
We can see the Nearest Neighbors have 0.9087838 accuracy and high ROC_AUC with 0.9485482 and have successful predicted 269 of 296 observations from the matrix.
Credit card fraud detection is a challenging problem that requires analyzing large amounts of transaction data to identify patterns of fraud. For the purposes of this project, I trained two prediction models to perform the same forecasting task and then compared the results to decide the final “best” forecast model with the highest accuracy. I tried to deal with imbalanced datasets using a sampling technique, specifically a credit card fraud transaction dataset where the proportion of fraudulent cases to total transactions is quite small. Since I balanced the data before training the model, I can use both the confusion matrix accuracy and the accuracy using the Area Under the Precision-Recall Curve (AUC) to analyze the predictions of my models. Although Nearest Neighbors models performed well, the Logistic Regression model yielded the highest accuracy of 0.9324324 with an AUC of 0.9732012. I will present the Logistic Regression as my final model. The hope is that in the near future, more accurate fraud detection systems can be developed to assist fraud investigators in detecting fraudulent transactions and proposing better policies for real-world regulation.