Credit fraud detection

Today, we are going to build a XGBoost model to detect credit fraud. Our data contains transactions made by credit cards in September 2013 by European cardholders. This data has been reduced dimension by PCA technique, only time and amount of transaction will be originally retained. The class collum indicates the state of fraud detection, of which, 0 means the transaction is not flagged as fraud and 1 means the transaction flagged as fraud.

Content

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

library(readr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ dplyr   1.0.6
## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──

## ✓ broom        0.7.6      ✓ rsample      0.1.0 
## ✓ dials        0.0.9      ✓ tune         0.1.5 
## ✓ infer        0.5.4      ✓ workflows    0.2.2 
## ✓ modeldata    0.1.0      ✓ workflowsets 0.0.2 
## ✓ parsnip      0.1.5      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.16

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

creditcard <- read_csv("creditcard.csv") %>%
  mutate(Class = factor(
    Class,
    levels = c(0, 1),
    labels = c("non_fraud", "fraud")
  ))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double()
## )
## ℹ Use `spec()` for the full column specifications.

skimr::skim(creditcard)

Data summary
Name	creditcard
Number of rows	284807
Number of columns	31
_______________________
Column type frequency:
factor	1
numeric	30
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1	FALSE	2	non: 284315, fra: 492

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Time	1	94813.86	47488.15	0.00	54201.50	84692.00	139320.50	172792.00	▃▇▅▆▇
V1	1	0.00	1.96	-56.41	-0.92	0.02	1.32	2.45	▁▁▁▁▇
V2	1	0.00	1.65	-72.72	-0.60	0.07	0.80	22.06	▁▁▁▇▁
V3	1	0.00	1.52	-48.33	-0.89	0.18	1.03	9.38	▁▁▁▁▇
V4	1	0.00	1.42	-5.68	-0.85	-0.02	0.74	16.88	▂▇▁▁▁
V5	1	0.00	1.38	-113.74	-0.69	-0.05	0.61	34.80	▁▁▁▇▁
V6	1	0.00	1.33	-26.16	-0.77	-0.27	0.40	73.30	▁▇▁▁▁
V7	1	0.00	1.24	-43.56	-0.55	0.04	0.57	120.59	▁▇▁▁▁
V8	1	0.00	1.19	-73.22	-0.21	0.02	0.33	20.01	▁▁▁▇▁
V9	1	0.00	1.10	-13.43	-0.64	-0.05	0.60	15.59	▁▁▇▁▁
V10	1	0.00	1.09	-24.59	-0.54	-0.09	0.45	23.75	▁▁▇▁▁
V11	1	0.00	1.02	-4.80	-0.76	-0.03	0.74	12.02	▁▇▁▁▁
V12	1	0.00	1.00	-18.68	-0.41	0.14	0.62	7.85	▁▁▁▇▁
V13	1	0.00	1.00	-5.79	-0.65	-0.01	0.66	7.13	▁▃▇▁▁
V14	1	0.00	0.96	-19.21	-0.43	0.05	0.49	10.53	▁▁▁▇▁
V15	1	0.00	0.92	-4.50	-0.58	0.05	0.65	8.88	▁▇▂▁▁
V16	1	0.00	0.88	-14.13	-0.47	0.07	0.52	17.32	▁▁▇▁▁
V17	1	0.00	0.85	-25.16	-0.48	-0.07	0.40	9.25	▁▁▁▇▁
V18	1	0.00	0.84	-9.50	-0.50	0.00	0.50	5.04	▁▁▂▇▁
V19	1	0.00	0.81	-7.21	-0.46	0.00	0.46	5.59	▁▁▇▂▁
V20	1	0.00	0.77	-54.50	-0.21	-0.06	0.13	39.42	▁▁▇▁▁
V21	1	0.00	0.73	-34.83	-0.23	-0.03	0.19	27.20	▁▁▇▁▁
V22	1	0.00	0.73	-10.93	-0.54	0.01	0.53	10.50	▁▁▇▁▁
V23	1	0.00	0.62	-44.81	-0.16	-0.01	0.15	22.53	▁▁▁▇▁
V24	1	0.00	0.61	-2.84	-0.35	0.04	0.44	4.58	▁▇▆▁▁
V25	1	0.00	0.52	-10.30	-0.32	0.02	0.35	7.52	▁▁▇▂▁
V26	1	0.00	0.48	-2.60	-0.33	-0.05	0.24	3.52	▁▆▇▁▁
V27	1	0.00	0.40	-22.57	-0.07	0.00	0.09	31.61	▁▁▇▁▁
V28	1	0.00	0.33	-15.43	-0.05	0.01	0.08	33.85	▁▇▁▁▁
Amount	1	88.35	250.12	0.00	5.60	22.00	77.16	25691.16	▇▁▁▁▁

.metric	.estimator	mean	n	std_err	.config
accuracy	binary	0.975	25	0.001	Preprocessor1_Model1
roc_auc	binary	0.975	25	0.001	Preprocessor1_Model1

Credit fraud detection

HUNG.NGUYEN

5/26/2021

Description

Content

Build our xgboost model

Split our data into training and testing data and create bootstrap resampling

Build preprocessing recipe and models

Fit the model

Prediction	Truth	Freq	per
non_fraud	non_fraud	69259	97.3
fraud	non_fraud	1816	2.6
fraud	fraud	115	0.2
non_fraud	fraud	12	0.0