Description

Today, we are going to build a XGBoost model to detect credit fraud. Our data contains transactions made by credit cards in September 2013 by European cardholders. This data has been reduced dimension by PCA technique, only time and amount of transaction will be originally retained. The class collum indicates the state of fraud detection, of which, 0 means the transaction is not flagged as fraud and 1 means the transaction flagged as fraud.

For more information about the dataset, visit: https://www.kaggle.com/mlg-ulb/creditcardfraud. Following are briefed information from the page.

Content

The dataset contains transactions made by credit cards in September 2013 by European cardholders. 
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ dplyr   1.0.6
## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──
## ✓ broom        0.7.6      ✓ rsample      0.1.0 
## ✓ dials        0.0.9      ✓ tune         0.1.5 
## ✓ infer        0.5.4      ✓ workflows    0.2.2 
## ✓ modeldata    0.1.0      ✓ workflowsets 0.0.2 
## ✓ parsnip      0.1.5      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.16
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
creditcard <- read_csv("creditcard.csv") %>%
  mutate(Class = factor(
    Class,
    levels = c(0, 1),
    labels = c("non_fraud", "fraud")
  ))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
skimr::skim(creditcard)
Data summary
Name creditcard
Number of rows 284807
Number of columns 31
_______________________
Column type frequency:
factor 1
numeric 30
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1 FALSE 2 non: 284315, fra: 492

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time 0 1 94813.86 47488.15 0.00 54201.50 84692.00 139320.50 172792.00 ▃▇▅▆▇
V1 0 1 0.00 1.96 -56.41 -0.92 0.02 1.32 2.45 ▁▁▁▁▇
V2 0 1 0.00 1.65 -72.72 -0.60 0.07 0.80 22.06 ▁▁▁▇▁
V3 0 1 0.00 1.52 -48.33 -0.89 0.18 1.03 9.38 ▁▁▁▁▇
V4 0 1 0.00 1.42 -5.68 -0.85 -0.02 0.74 16.88 ▂▇▁▁▁
V5 0 1 0.00 1.38 -113.74 -0.69 -0.05 0.61 34.80 ▁▁▁▇▁
V6 0 1 0.00 1.33 -26.16 -0.77 -0.27 0.40 73.30 ▁▇▁▁▁
V7 0 1 0.00 1.24 -43.56 -0.55 0.04 0.57 120.59 ▁▇▁▁▁
V8 0 1 0.00 1.19 -73.22 -0.21 0.02 0.33 20.01 ▁▁▁▇▁
V9 0 1 0.00 1.10 -13.43 -0.64 -0.05 0.60 15.59 ▁▁▇▁▁
V10 0 1 0.00 1.09 -24.59 -0.54 -0.09 0.45 23.75 ▁▁▇▁▁
V11 0 1 0.00 1.02 -4.80 -0.76 -0.03 0.74 12.02 ▁▇▁▁▁
V12 0 1 0.00 1.00 -18.68 -0.41 0.14 0.62 7.85 ▁▁▁▇▁
V13 0 1 0.00 1.00 -5.79 -0.65 -0.01 0.66 7.13 ▁▃▇▁▁
V14 0 1 0.00 0.96 -19.21 -0.43 0.05 0.49 10.53 ▁▁▁▇▁
V15 0 1 0.00 0.92 -4.50 -0.58 0.05 0.65 8.88 ▁▇▂▁▁
V16 0 1 0.00 0.88 -14.13 -0.47 0.07 0.52 17.32 ▁▁▇▁▁
V17 0 1 0.00 0.85 -25.16 -0.48 -0.07 0.40 9.25 ▁▁▁▇▁
V18 0 1 0.00 0.84 -9.50 -0.50 0.00 0.50 5.04 ▁▁▂▇▁
V19 0 1 0.00 0.81 -7.21 -0.46 0.00 0.46 5.59 ▁▁▇▂▁
V20 0 1 0.00 0.77 -54.50 -0.21 -0.06 0.13 39.42 ▁▁▇▁▁
V21 0 1 0.00 0.73 -34.83 -0.23 -0.03 0.19 27.20 ▁▁▇▁▁
V22 0 1 0.00 0.73 -10.93 -0.54 0.01 0.53 10.50 ▁▁▇▁▁
V23 0 1 0.00 0.62 -44.81 -0.16 -0.01 0.15 22.53 ▁▁▁▇▁
V24 0 1 0.00 0.61 -2.84 -0.35 0.04 0.44 4.58 ▁▇▆▁▁
V25 0 1 0.00 0.52 -10.30 -0.32 0.02 0.35 7.52 ▁▁▇▂▁
V26 0 1 0.00 0.48 -2.60 -0.33 -0.05 0.24 3.52 ▁▆▇▁▁
V27 0 1 0.00 0.40 -22.57 -0.07 0.00 0.09 31.61 ▁▁▇▁▁
V28 0 1 0.00 0.33 -15.43 -0.05 0.01 0.08 33.85 ▁▇▁▁▁
Amount 0 1 88.35 250.12 0.00 5.60 22.00 77.16 25691.16 ▇▁▁▁▁

Build our xgboost model

Split our data into training and testing data and create bootstrap resampling

From the credir dataframe, we split testing data and training dat and then, create resampling using bootstrap method. We stratafy our data by class.

set.seed(123)
credit_split <- initial_split(creditcard, strata = Class)
credit_test <- testing(credit_split)
credit_train <- training(credit_split)

set.seed(234)
bs_resample <- bootstraps(credit_train, strata = Class)

Build preprocessing recipe and models

In preprocessing step, we downsample our data by class since Class is highly unbalance. Only 0.172% of all transaction are detected as fraud transaction. If we don’t do downsample the Class, the model might result less accuracy.

xgboost_recipe <- 
  recipe(formula = Class ~ ., data = credit_train) %>% 
  themis::step_downsample(Class) %>% 
  step_zv(all_predictors())
## Registered S3 methods overwritten by 'themis':
##   method                  from   
##   bake.step_downsample    recipes
##   bake.step_upsample      recipes
##   prep.step_downsample    recipes
##   prep.step_upsample      recipes
##   tidy.step_downsample    recipes
##   tidy.step_upsample      recipes
##   tunable.step_downsample recipes
##   tunable.step_upsample   recipes
xgboost_spec <- 
  boost_tree() %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

Fit the model

This step fit the model with bootstrap resampling.

set.seed(8000)
fit_resam<- xgboost_workflow %>% fit_resamples(resamples = bs_resample,
              metric = metrics(roc_auc, accuracy),
              control = control_resamples(save_pred = T))
## Warning: The `...` are not used in this function but one or more objects were
## passed: 'metric'

Let’s see how our models performs.

fit_resam %>% collect_metrics() %>% knitr::kable(digits = 3)
.metric .estimator mean n std_err .config
accuracy binary 0.975 25 0.001 Preprocessor1_Model1
roc_auc binary 0.975 25 0.001 Preprocessor1_Model1

Now, we finally fit our model on testing data which wasn’t used to train. We can correctly predict approximate 96% of total cases. However, for fraud transaction, we can correctly predict 92,4%, relatively high.

last_fit <- last_fit(xgboost_workflow, credit_split)
last_fit %>% conf_mat_resampled() %>% arrange(Truth, desc(Freq)) %>% mutate(per = 100*Freq/sum(Freq)) %>% knitr::kable(digits = 1)
Prediction Truth Freq per
non_fraud non_fraud 69259 97.3
fraud non_fraud 1816 2.6
fraud fraud 115 0.2
non_fraud fraud 12 0.0