Intro

This is a demonstration of how to use Amazon SageMaker via R Studio rather than a Jupyter notebook. This demo closely follows the post from Ryan Garner: https://aws.amazon.com/blogs/machine-learning/using-r-with-amazon-sagemaker/. Special thanks to him for putting together the original tutorial and virtual stack on which this demo was run.

Goals:

  1. Correctly classify fraudulent credit card transactions using Amazon SageMaker
  2. Use the XGBoost machine learning algorithm on a highly imbalanced dataset
  3. Recreate the original tutorial with a different input dataset

This notebook is not a full end-to-end analysis, and as such it lacks many key steps such as robust exploratory data analysis, variable transformations, feature extraction, feature importance analysis and data visualization. The notebook is intended to explore how well AWS SageMaker and XGBoost perform on a highly imbalanced fraud dataset. To recreate this analysis, an AWS account is required (free tier is sufficient). Please refer to the above link to the original tutorial for steps required to launch an Amazon CloudFormation stack to run R Studio in the cloud on an AWS EC2 (Elastic Cloud Compute) instance with AWS SageMaker.

XGBoost Reference:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html https://xgboost.readthedocs.io/en/latest/

The Data

The data was taken from Kaggle : https://www.kaggle.com/mlg-ulb/creditcardfraud/home, Specifically the ‘Credit Card Fraud Detection’ competition. The data itself was preprocessed, featuring all numeric variables. All but two of the variables were the result of PCA (Principal Component Analysis) which is a method of dimensionality reduction. This method seeks to condense a large set of variables into a smaller number of higher level numerical concepts. See: https://en.wikipedia.org/wiki/Principal_component_analysis

Environment Set Up

Load the reticulate package (for conversion of R code into Python syntax) and import the ‘sagemaker’ module.

The Session class provides operations for working with the following boto3 resources with Amazon SageMaker:

library(reticulate)
library(dplyr)

sagemaker <- import('sagemaker')

Create storage in S3.

This creates a default bucket to store the data: INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-. My personal AWS Account Number was withheld.

session <- sagemaker$Session()
bucket <- session$default_bucket()

Specify the IAM role’s ARN to allow Amazon SageMaker to access the Amazon S3 bucket:

role_arn <- session$expand_role('sagemaker-service-role')

Load In Data

For this demo, as it is being completed on an R Markdown notebook which is hosted on a temporary EC2 instance, the file was simply uploaded to R Studio from my local drive. Readers can download the data from Kaggle and upload on their own if desired.

It is worthwhile noting that the data is quite clean, with no missing data. The target response (dependent) variable “Class” is at the end, with a mean of 0.001728, demonstrating that it is heavily imbalanced.

fraud <- read.csv(file = "creditcard.csv", header = TRUE)
fraud <- data.frame(sapply(fraud, as.numeric))
summary(fraud)
##       Time              V1                  V2               V3           
##  Min.   :     0   Min.   :-56.41000   Min.   :-72.72   Min.   :-48.33000  
##  1st Qu.: 54202   1st Qu.: -0.92000   1st Qu.: -0.60   1st Qu.: -0.89000  
##  Median : 84692   Median :  0.02000   Median :  0.07   Median :  0.18000  
##  Mean   : 94814   Mean   :  0.00001   Mean   :  0.00   Mean   : -0.00001  
##  3rd Qu.:139320   3rd Qu.:  1.32000   3rd Qu.:  0.80   3rd Qu.:  1.03000  
##  Max.   :172792   Max.   :  2.45000   Max.   : 22.06   Max.   :  9.38000  
##        V4                  V5                V6               V7        
##  Min.   :-5.680000   Min.   :-113.74   Min.   :-26.16   Min.   :-43.56  
##  1st Qu.:-0.850000   1st Qu.:  -0.69   1st Qu.: -0.77   1st Qu.: -0.55  
##  Median :-0.020000   Median :  -0.05   Median : -0.27   Median :  0.04  
##  Mean   :-0.000002   Mean   :   0.00   Mean   :  0.00   Mean   :  0.00  
##  3rd Qu.: 0.740000   3rd Qu.:   0.61   3rd Qu.:  0.40   3rd Qu.:  0.57  
##  Max.   :16.880000   Max.   :  34.80   Max.   : 73.30   Max.   :120.59  
##        V8                  V9                  V10            
##  Min.   :-73.22000   Min.   :-13.430000   Min.   :-24.590000  
##  1st Qu.: -0.21000   1st Qu.: -0.640000   1st Qu.: -0.540000  
##  Median :  0.02000   Median : -0.050000   Median : -0.090000  
##  Mean   : -0.00001   Mean   :  0.000003   Mean   : -0.000003  
##  3rd Qu.:  0.33000   3rd Qu.:  0.600000   3rd Qu.:  0.450000  
##  Max.   : 20.01000   Max.   : 15.590000   Max.   : 23.750000  
##       V11                 V12                  V13           
##  Min.   :-4.800000   Min.   :-18.680000   Min.   :-5.790000  
##  1st Qu.:-0.760000   1st Qu.: -0.410000   1st Qu.:-0.650000  
##  Median :-0.030000   Median :  0.140000   Median :-0.010000  
##  Mean   :-0.000007   Mean   : -0.000009   Mean   : 0.000008  
##  3rd Qu.: 0.740000   3rd Qu.:  0.620000   3rd Qu.: 0.660000  
##  Max.   :12.020000   Max.   :  7.850000   Max.   : 7.130000  
##       V14                  V15                 V16            
##  Min.   :-19.210000   Min.   :-4.500000   Min.   :-14.130000  
##  1st Qu.: -0.430000   1st Qu.:-0.580000   1st Qu.: -0.470000  
##  Median :  0.050000   Median : 0.050000   Median :  0.070000  
##  Mean   :  0.000004   Mean   : 0.000002   Mean   : -0.000006  
##  3rd Qu.:  0.490000   3rd Qu.: 0.650000   3rd Qu.:  0.520000  
##  Max.   : 10.530000   Max.   : 8.880000   Max.   : 17.320000  
##       V17                 V18                 V19           
##  Min.   :-25.16000   Min.   :-9.500000   Min.   :-7.210000  
##  1st Qu.: -0.48000   1st Qu.:-0.500000   1st Qu.:-0.460000  
##  Median : -0.07000   Median : 0.000000   Median : 0.000000  
##  Mean   :  0.00001   Mean   :-0.000006   Mean   :-0.000009  
##  3rd Qu.:  0.40000   3rd Qu.: 0.500000   3rd Qu.: 0.460000  
##  Max.   :  9.25000   Max.   : 5.040000   Max.   : 5.590000  
##       V20                 V21              V22            
##  Min.   :-54.50000   Min.   :-34.83   Min.   :-10.930000  
##  1st Qu.: -0.21000   1st Qu.: -0.23   1st Qu.: -0.540000  
##  Median : -0.06000   Median : -0.03   Median :  0.010000  
##  Mean   :  0.00001   Mean   :  0.00   Mean   :  0.000005  
##  3rd Qu.:  0.13000   3rd Qu.:  0.19   3rd Qu.:  0.530000  
##  Max.   : 39.42000   Max.   : 27.20   Max.   : 10.500000  
##       V23              V24                 V25           
##  Min.   :-44.81   Min.   :-2.840000   Min.   :-1.03e+01  
##  1st Qu.: -0.16   1st Qu.:-0.350000   1st Qu.:-3.20e-01  
##  Median : -0.01   Median : 0.040000   Median : 2.00e-02  
##  Mean   :  0.00   Mean   : 0.000004   Mean   :-2.00e-06  
##  3rd Qu.:  0.15   3rd Qu.: 0.440000   3rd Qu.: 3.50e-01  
##  Max.   : 22.53   Max.   : 4.580000   Max.   : 7.52e+00  
##       V26                 V27                  V28        
##  Min.   :-2.600000   Min.   :-22.570000   Min.   :-15.43  
##  1st Qu.:-0.330000   1st Qu.: -0.070000   1st Qu.: -0.05  
##  Median :-0.050000   Median :  0.000000   Median :  0.01  
##  Mean   : 0.000002   Mean   : -0.000001   Mean   :  0.00  
##  3rd Qu.: 0.240000   3rd Qu.:  0.090000   3rd Qu.:  0.08  
##  Max.   : 3.520000   Max.   : 31.610000   Max.   : 33.85  
##      Amount             Class         
##  Min.   :    0.00   Min.   :0.000000  
##  1st Qu.:    5.60   1st Qu.:0.000000  
##  Median :   22.00   Median :0.000000  
##  Mean   :   88.35   Mean   :0.001728  
##  3rd Qu.:   77.17   3rd Qu.:0.000000  
##  Max.   :25691.16   Max.   :1.000000

Here is a quick scatterplot of the “Class” variable by the “Amount” variable. This indicates that none of the fraudulent cases constituted an exceptionally large monetary loss.

library(ggplot2)
ggplot(fraud,
       aes(x = Class,
           y = Amount,
           color = as.factor(Class))) +
  geom_point() + 
  geom_jitter()

Transform the data set as needed. The raw input data needs little transformation apart from moving the target variable to the first column of the dataframe.

fraud <- fraud %>% dplyr::select(Class, Time:Amount)

Next, we need to split the data into train, test and validation sets. Before doing this, I prefer to randomly shuffle the data to ensure complete randomness. This is done by randomly assigning a number (without replacement) to each row of the dataframe and then sorting the data.

# create a random number column with no replacement
set.seed(1000)
seq = sample(1:nrow(fraud), nrow(fraud), replace = FALSE)
seq = as.data.frame(seq)

# append seq to the terms matrix to be shuffled
# to ensure randomization
fraud <- cbind(fraud, seq)

#shuffle the rows, view to ensure randomization
fraud <- fraud %>% arrange(desc(seq))
fraud <- tbl_df(fraud)

#get rid of the seq column, delete seq
remove(seq)
fraud <- fraud[,1:31] 

# view data
head(fraud[,1:6])

Now subset the data into train, test, valid sets. I would have used createDataPartition() from the ‘caret’ package, however there were issues loading (and other packages) onto this specific EC2 instance.

# create training set of 70% of the observations
train <- fraud %>% sample_frac(0.7)

# take the remaining data and break it into 50% test, 50% validation
interim <- anti_join(fraud, train)

test <- interim %>% sample_frac(0.5)
valid <- anti_join(interim, test)

Write the train and validation sets to .CSV. Note: use the ‘write_csv’ function from the ‘readr’ package rather than more commonly used ‘write.csv’ or ‘write.table’ from base R. I encountered issues with write.csv() and write.table() due to encoding which caused some variables to be converted from numeric to integer - which then caused an error when training the algorith later on.

# write data to working directory in preparation for S3 upload
library(readr)
write_csv(train, 'train.csv', col_names = F)
write_csv(valid, 'valid.csv', col_names = F)

Upload the train and validation datasets to the Amazon S3 bucket.

# upload training data
s3_train <- session$upload_data(path = 'train.csv',
                                bucket = bucket,
                                key_prefix = 'data')
# upload validation data
s3_valid <- session$upload_data(path = 'valid.csv',
                                bucket = bucket,
                                key_prefix = 'data')

Define the S3 input types for Amazon SageMaker. This step is essentially telling SageMaker that we are sending it .CSV files.

s3_train_input <- sagemaker$s3_input(s3_data = s3_train,
                                     content_type = 'csv')

s3_valid_input <- sagemaker$s3_input(s3_data = s3_valid,
                                     content_type = 'csv')

Model Training

Set up the training Docker containers in Amazon Elastic Container Registry (ECR)

containers <- list('us-west-2' = '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
  'us-east-1' = '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
  'us-east-2' = '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
  'eu-west-1' = '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest')
container <- containers[session$boto_region_name][[1]]

# this is the container we will use
container
## [1] "811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest"

Define an Amazon SageMaker Estimator. An ‘Estimator’ is AWS term for a hosted, trained machine learning model. See: https://sagemaker.readthedocs.io/en/stable/estimators.html

# create Estimator
s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_name = container,
                                           role = role_arn,
                                           train_instance_count = 1L,
                                           train_instance_type = 'ml.m5.large',
                                           train_volume_size = 30L,
                                           train_max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_output,
                                           output_kms_key = NULL,
                                           base_job_name = NULL,
                                           sagemaker_session = NULL)

Set XGBoost Parameters

Specify parameters and train the XGBoost model. These parameters differ from the published demo, specifically the objective being binary:logistic. Since we are using a different input data set than the original demo, and trying to predict 1 for ‘fraudulent’ or 0 for ‘not fraudulent’, the default ‘reg:linear’ objective parameter would be inappropriate: “binary:logistic” should be specified.

As this is something of an ‘out-of-the-box’ exercise, I did not spend time tuning hyperperamaters as one would ideally do during the full modeling process. Potential parapeter specifications were indicated and commented out.

# xgboost hyperparamater set up
estimator$set_hyperparameters(num_round = 100L,
                              eta = 0.1,
                              # max_depth = 30L,
                              # max_delta_step = 5L,
                              # nround = 30L, 
                              # subsample = 0.75,
                              # colsample_bytree = 0.75,
                              # seed = 1L,
                              # eval_metric = "merror"
                              # num_class = 2L,
                               objective = "binary:logistic")
                              
# job
job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')

# create input object
input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)
# fit
estimator$fit(inputs = input_data,
              job_name = job_name)

This is the trained model object, viewed in it’s S3 location. Output withheld as it contains specific AWS account info.

estimator$model_data

Deploying the Model

Now that the model is created, it will be deployed to an endpoint for prediction. This make take up to ~ 5 minutes. Once deployed, the model will be able to accept test data for prediction purposes, and if the model is sufficiently accurate, would be deployed in a production environment.

# create endpoint
model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium')

Tell the endpoint what type of content to expect.

model_endpoint$content_type <- 'text/csv'
model_endpoint$serializer <- sagemaker$predictor$csv_serializer

Test Predictions

Since the goal is to predict/classify a fraudlent credit card transaction, the next step is to send some test data to our deployed model. There is a limitation to the number of predictions that can be made, so a random sample of 500 records will be taken from the test set.

In the original raw dataset, the occurences of fraud represented 492 out of 284,807 records. This represents a fraud probability = 0.001727486, or roughly 1 in 500 chance (rounded). As such, the data would be considered highly imbalanced with regard to the target variable “Class”, and occurences of fraud should be considered anomalies.

When testing the model, a random sample of 500 records were pulled from the test set. Below, we see that there was 1 case of fraud found in this random sample, as one would expect most of the time.

# set seed for reproducibility
set.seed(1000)

# take a random sample
fraud_sample <- test[sample(nrow(test), 500),]
table(fraud_sample$Class)
## 
##   0   1 
## 499   1

Before testing, the sample must be transformed to remove the target variable “Class”, and converted to a numeric matrix with no column names.

# save a copy of actual classes
actuals <- fraud_sample

# convert to matrix and remove colnames
fraud_sample <- as.matrix(fraud_sample[2:ncol(fraud_sample)])
dimnames(fraud_sample)[[2]] <- NULL

# view test data matrix
head(fraud_sample)[,1:6]
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] 163980  1.93  0.57 -0.93  3.45  0.94
## [2,]  30528 -1.07 -0.47  1.68 -1.00 -0.90
## [3,]   1704 -4.65  3.22 -1.50 -2.47 -1.31
## [4,] 124794 -0.26  0.99  0.82  0.06  1.08
## [5,]  96000  0.02  0.24 -1.03 -0.53  2.13
## [6,] 167225 -3.70  3.88 -3.88 -1.07 -1.51

Now the sample of 500 transactions, containing the one fraudulent transaction is set to the model endpoint, within the predict() function. The predictions returned by the deployed model are in a string format, separated by commas. Each prediction is a probability of either fraud (Class = 1) or not fraud (Class = 0).

This prediction string needs to be parsed out to a character vector using the stringr package, and then converted to numeric data type. Note: change options to disable scientific notation.

library(stringr)

#disable scientific notation
options(scipen = 999)

# generate predictions
predictions <- model_endpoint$predict(fraud_sample)

# parse out and convert to numeric dataframe
predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
predictions <- as.numeric(predictions)
predictions <- data.frame(predictions)

# take a look at the predictions from largest to smallest
head(data.frame(fraud_prob = predictions[order(-predictions),]))

As indicated above, it appears the XGBoost model has isolated a single observation as having a high probability of being fraudulent. To find out if the model did find the right fraudulent transaction, we need to compare to the actual data.

Below, we column-bind the predictions to the original fraud sample of 500, and filter down to the 1 fraudulent transaction. In doing so, we see that the deployed model did successfully find the fraudulent transaction.

actuals <- cbind(predictions, actuals)
actuals[1:8] %>% dplyr::filter(Class == 1)

Confusion Matrix

This is a quick-n-dirty confusion matrix, created by first establishing a decision threshold (probability cut-point) of 0.8. Any transaction with a probability exceeding 80% will be considered fraudulent and flagged. The predictions are cross-tabbed against the actuals, for an accuracy of 100%. All non-fraud transactions were correctly classified as ‘0’, and the 1 fraudulent transaction was correctly classified as a ‘1’.

Again, while this seems encouraging, we are only testing on a 500 transaction sample. A more robust testing proceedure should be completed.

# use decision threshold for binary classification
actuals <- actuals %>% 
              dplyr::mutate(pred_class = ifelse(predictions > 0.9, 
                                                as.numeric(1), 
                                                as.numeric(0)))

# create confusion matrix
table(actuals$pred_class, actuals$Class)
##    
##       0   1
##   0 499   0
##   1   0   1

Conclusion

The results of the demo using XGBoost on the Kaggle Credit Card Fraud data were promising. Admittedly, the heavy lifting had been done in terms of data collection, cleaning and preprocessing. The success of the exercise can largely be attributed to the data quality and the robust strength of the XGBoost algorithm. While only a sample of the actual test data was run, achievinig a 100% accuracy was a small win. The approach outlined in this notebook could serve as a stepping stone to a fully expanded fraud detection production environment. Thanks for reading!