This is a demonstration of how to use Amazon SageMaker via R Studio rather than a Jupyter notebook. This demo closely follows the post from Ryan Garner: https://aws.amazon.com/blogs/machine-learning/using-r-with-amazon-sagemaker/. Special thanks to him for putting together the original tutorial and virtual stack on which this demo was run.
Goals:
This notebook is not a full end-to-end analysis, and as such it lacks many key steps such as robust exploratory data analysis, variable transformations, feature extraction, feature importance analysis and data visualization. The notebook is intended to explore how well AWS SageMaker and XGBoost perform on a highly imbalanced fraud dataset. To recreate this analysis, an AWS account is required (free tier is sufficient). Please refer to the above link to the original tutorial for steps required to launch an Amazon CloudFormation stack to run R Studio in the cloud on an AWS EC2 (Elastic Cloud Compute) instance with AWS SageMaker.
XGBoost Reference:
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html https://xgboost.readthedocs.io/en/latest/
The data was taken from Kaggle : https://www.kaggle.com/mlg-ulb/creditcardfraud/home, Specifically the ‘Credit Card Fraud Detection’ competition. The data itself was preprocessed, featuring all numeric variables. All but two of the variables were the result of PCA (Principal Component Analysis) which is a method of dimensionality reduction. This method seeks to condense a large set of variables into a smaller number of higher level numerical concepts. See: https://en.wikipedia.org/wiki/Principal_component_analysis
Load the reticulate package (for conversion of R code into Python syntax) and import the ‘sagemaker’ module.
The Session class provides operations for working with the following boto3 resources with Amazon SageMaker:
library(reticulate)
library(dplyr)
sagemaker <- import('sagemaker')
This creates a default bucket to store the data: INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-
session <- sagemaker$Session()
bucket <- session$default_bucket()
Specify the IAM role’s ARN to allow Amazon SageMaker to access the Amazon S3 bucket:
role_arn <- session$expand_role('sagemaker-service-role')
For this demo, as it is being completed on an R Markdown notebook which is hosted on a temporary EC2 instance, the file was simply uploaded to R Studio from my local drive. Readers can download the data from Kaggle and upload on their own if desired.
It is worthwhile noting that the data is quite clean, with no missing data. The target response (dependent) variable “Class” is at the end, with a mean of 0.001728, demonstrating that it is heavily imbalanced.
fraud <- read.csv(file = "creditcard.csv", header = TRUE)
fraud <- data.frame(sapply(fraud, as.numeric))
summary(fraud)
## Time V1 V2 V3
## Min. : 0 Min. :-56.41000 Min. :-72.72 Min. :-48.33000
## 1st Qu.: 54202 1st Qu.: -0.92000 1st Qu.: -0.60 1st Qu.: -0.89000
## Median : 84692 Median : 0.02000 Median : 0.07 Median : 0.18000
## Mean : 94814 Mean : 0.00001 Mean : 0.00 Mean : -0.00001
## 3rd Qu.:139320 3rd Qu.: 1.32000 3rd Qu.: 0.80 3rd Qu.: 1.03000
## Max. :172792 Max. : 2.45000 Max. : 22.06 Max. : 9.38000
## V4 V5 V6 V7
## Min. :-5.680000 Min. :-113.74 Min. :-26.16 Min. :-43.56
## 1st Qu.:-0.850000 1st Qu.: -0.69 1st Qu.: -0.77 1st Qu.: -0.55
## Median :-0.020000 Median : -0.05 Median : -0.27 Median : 0.04
## Mean :-0.000002 Mean : 0.00 Mean : 0.00 Mean : 0.00
## 3rd Qu.: 0.740000 3rd Qu.: 0.61 3rd Qu.: 0.40 3rd Qu.: 0.57
## Max. :16.880000 Max. : 34.80 Max. : 73.30 Max. :120.59
## V8 V9 V10
## Min. :-73.22000 Min. :-13.430000 Min. :-24.590000
## 1st Qu.: -0.21000 1st Qu.: -0.640000 1st Qu.: -0.540000
## Median : 0.02000 Median : -0.050000 Median : -0.090000
## Mean : -0.00001 Mean : 0.000003 Mean : -0.000003
## 3rd Qu.: 0.33000 3rd Qu.: 0.600000 3rd Qu.: 0.450000
## Max. : 20.01000 Max. : 15.590000 Max. : 23.750000
## V11 V12 V13
## Min. :-4.800000 Min. :-18.680000 Min. :-5.790000
## 1st Qu.:-0.760000 1st Qu.: -0.410000 1st Qu.:-0.650000
## Median :-0.030000 Median : 0.140000 Median :-0.010000
## Mean :-0.000007 Mean : -0.000009 Mean : 0.000008
## 3rd Qu.: 0.740000 3rd Qu.: 0.620000 3rd Qu.: 0.660000
## Max. :12.020000 Max. : 7.850000 Max. : 7.130000
## V14 V15 V16
## Min. :-19.210000 Min. :-4.500000 Min. :-14.130000
## 1st Qu.: -0.430000 1st Qu.:-0.580000 1st Qu.: -0.470000
## Median : 0.050000 Median : 0.050000 Median : 0.070000
## Mean : 0.000004 Mean : 0.000002 Mean : -0.000006
## 3rd Qu.: 0.490000 3rd Qu.: 0.650000 3rd Qu.: 0.520000
## Max. : 10.530000 Max. : 8.880000 Max. : 17.320000
## V17 V18 V19
## Min. :-25.16000 Min. :-9.500000 Min. :-7.210000
## 1st Qu.: -0.48000 1st Qu.:-0.500000 1st Qu.:-0.460000
## Median : -0.07000 Median : 0.000000 Median : 0.000000
## Mean : 0.00001 Mean :-0.000006 Mean :-0.000009
## 3rd Qu.: 0.40000 3rd Qu.: 0.500000 3rd Qu.: 0.460000
## Max. : 9.25000 Max. : 5.040000 Max. : 5.590000
## V20 V21 V22
## Min. :-54.50000 Min. :-34.83 Min. :-10.930000
## 1st Qu.: -0.21000 1st Qu.: -0.23 1st Qu.: -0.540000
## Median : -0.06000 Median : -0.03 Median : 0.010000
## Mean : 0.00001 Mean : 0.00 Mean : 0.000005
## 3rd Qu.: 0.13000 3rd Qu.: 0.19 3rd Qu.: 0.530000
## Max. : 39.42000 Max. : 27.20 Max. : 10.500000
## V23 V24 V25
## Min. :-44.81 Min. :-2.840000 Min. :-1.03e+01
## 1st Qu.: -0.16 1st Qu.:-0.350000 1st Qu.:-3.20e-01
## Median : -0.01 Median : 0.040000 Median : 2.00e-02
## Mean : 0.00 Mean : 0.000004 Mean :-2.00e-06
## 3rd Qu.: 0.15 3rd Qu.: 0.440000 3rd Qu.: 3.50e-01
## Max. : 22.53 Max. : 4.580000 Max. : 7.52e+00
## V26 V27 V28
## Min. :-2.600000 Min. :-22.570000 Min. :-15.43
## 1st Qu.:-0.330000 1st Qu.: -0.070000 1st Qu.: -0.05
## Median :-0.050000 Median : 0.000000 Median : 0.01
## Mean : 0.000002 Mean : -0.000001 Mean : 0.00
## 3rd Qu.: 0.240000 3rd Qu.: 0.090000 3rd Qu.: 0.08
## Max. : 3.520000 Max. : 31.610000 Max. : 33.85
## Amount Class
## Min. : 0.00 Min. :0.000000
## 1st Qu.: 5.60 1st Qu.:0.000000
## Median : 22.00 Median :0.000000
## Mean : 88.35 Mean :0.001728
## 3rd Qu.: 77.17 3rd Qu.:0.000000
## Max. :25691.16 Max. :1.000000
Here is a quick scatterplot of the “Class” variable by the “Amount” variable. This indicates that none of the fraudulent cases constituted an exceptionally large monetary loss.
library(ggplot2)
ggplot(fraud,
aes(x = Class,
y = Amount,
color = as.factor(Class))) +
geom_point() +
geom_jitter()
Transform the data set as needed. The raw input data needs little transformation apart from moving the target variable to the first column of the dataframe.
fraud <- fraud %>% dplyr::select(Class, Time:Amount)
Next, we need to split the data into train, test and validation sets. Before doing this, I prefer to randomly shuffle the data to ensure complete randomness. This is done by randomly assigning a number (without replacement) to each row of the dataframe and then sorting the data.
# create a random number column with no replacement
set.seed(1000)
seq = sample(1:nrow(fraud), nrow(fraud), replace = FALSE)
seq = as.data.frame(seq)
# append seq to the terms matrix to be shuffled
# to ensure randomization
fraud <- cbind(fraud, seq)
#shuffle the rows, view to ensure randomization
fraud <- fraud %>% arrange(desc(seq))
fraud <- tbl_df(fraud)
#get rid of the seq column, delete seq
remove(seq)
fraud <- fraud[,1:31]
# view data
head(fraud[,1:6])
Now subset the data into train, test, valid sets. I would have used createDataPartition() from the ‘caret’ package, however there were issues loading (and other packages) onto this specific EC2 instance.
# create training set of 70% of the observations
train <- fraud %>% sample_frac(0.7)
# take the remaining data and break it into 50% test, 50% validation
interim <- anti_join(fraud, train)
test <- interim %>% sample_frac(0.5)
valid <- anti_join(interim, test)
Write the train and validation sets to .CSV. Note: use the ‘write_csv’ function from the ‘readr’ package rather than more commonly used ‘write.csv’ or ‘write.table’ from base R. I encountered issues with write.csv() and write.table() due to encoding which caused some variables to be converted from numeric to integer - which then caused an error when training the algorith later on.
# write data to working directory in preparation for S3 upload
library(readr)
write_csv(train, 'train.csv', col_names = F)
write_csv(valid, 'valid.csv', col_names = F)
Upload the train and validation datasets to the Amazon S3 bucket.
# upload training data
s3_train <- session$upload_data(path = 'train.csv',
bucket = bucket,
key_prefix = 'data')
# upload validation data
s3_valid <- session$upload_data(path = 'valid.csv',
bucket = bucket,
key_prefix = 'data')
Define the S3 input types for Amazon SageMaker. This step is essentially telling SageMaker that we are sending it .CSV files.
s3_train_input <- sagemaker$s3_input(s3_data = s3_train,
content_type = 'csv')
s3_valid_input <- sagemaker$s3_input(s3_data = s3_valid,
content_type = 'csv')
Set up the training Docker containers in Amazon Elastic Container Registry (ECR)
containers <- list('us-west-2' = '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
'us-east-1' = '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
'us-east-2' = '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
'eu-west-1' = '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest')
container <- containers[session$boto_region_name][[1]]
# this is the container we will use
container
## [1] "811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest"
Define an Amazon SageMaker Estimator. An ‘Estimator’ is AWS term for a hosted, trained machine learning model. See: https://sagemaker.readthedocs.io/en/stable/estimators.html
# create Estimator
s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_name = container,
role = role_arn,
train_instance_count = 1L,
train_instance_type = 'ml.m5.large',
train_volume_size = 30L,
train_max_run = 3600L,
input_mode = 'File',
output_path = s3_output,
output_kms_key = NULL,
base_job_name = NULL,
sagemaker_session = NULL)
Specify parameters and train the XGBoost model. These parameters differ from the published demo, specifically the objective being binary:logistic. Since we are using a different input data set than the original demo, and trying to predict 1 for ‘fraudulent’ or 0 for ‘not fraudulent’, the default ‘reg:linear’ objective parameter would be inappropriate: “binary:logistic” should be specified.
As this is something of an ‘out-of-the-box’ exercise, I did not spend time tuning hyperperamaters as one would ideally do during the full modeling process. Potential parapeter specifications were indicated and commented out.
# xgboost hyperparamater set up
estimator$set_hyperparameters(num_round = 100L,
eta = 0.1,
# max_depth = 30L,
# max_delta_step = 5L,
# nround = 30L,
# subsample = 0.75,
# colsample_bytree = 0.75,
# seed = 1L,
# eval_metric = "merror"
# num_class = 2L,
objective = "binary:logistic")
# job
job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')
# create input object
input_data <- list('train' = s3_train_input,
'validation' = s3_valid_input)
# fit
estimator$fit(inputs = input_data,
job_name = job_name)
This is the trained model object, viewed in it’s S3 location. Output withheld as it contains specific AWS account info.
estimator$model_data
Now that the model is created, it will be deployed to an endpoint for prediction. This make take up to ~ 5 minutes. Once deployed, the model will be able to accept test data for prediction purposes, and if the model is sufficiently accurate, would be deployed in a production environment.
# create endpoint
model_endpoint <- estimator$deploy(initial_instance_count = 1L,
instance_type = 'ml.t2.medium')
Tell the endpoint what type of content to expect.
model_endpoint$content_type <- 'text/csv'
model_endpoint$serializer <- sagemaker$predictor$csv_serializer
Since the goal is to predict/classify a fraudlent credit card transaction, the next step is to send some test data to our deployed model. There is a limitation to the number of predictions that can be made, so a random sample of 500 records will be taken from the test set.
In the original raw dataset, the occurences of fraud represented 492 out of 284,807 records. This represents a fraud probability = 0.001727486, or roughly 1 in 500 chance (rounded). As such, the data would be considered highly imbalanced with regard to the target variable “Class”, and occurences of fraud should be considered anomalies.
When testing the model, a random sample of 500 records were pulled from the test set. Below, we see that there was 1 case of fraud found in this random sample, as one would expect most of the time.
# set seed for reproducibility
set.seed(1000)
# take a random sample
fraud_sample <- test[sample(nrow(test), 500),]
table(fraud_sample$Class)
##
## 0 1
## 499 1
Before testing, the sample must be transformed to remove the target variable “Class”, and converted to a numeric matrix with no column names.
# save a copy of actual classes
actuals <- fraud_sample
# convert to matrix and remove colnames
fraud_sample <- as.matrix(fraud_sample[2:ncol(fraud_sample)])
dimnames(fraud_sample)[[2]] <- NULL
# view test data matrix
head(fraud_sample)[,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 163980 1.93 0.57 -0.93 3.45 0.94
## [2,] 30528 -1.07 -0.47 1.68 -1.00 -0.90
## [3,] 1704 -4.65 3.22 -1.50 -2.47 -1.31
## [4,] 124794 -0.26 0.99 0.82 0.06 1.08
## [5,] 96000 0.02 0.24 -1.03 -0.53 2.13
## [6,] 167225 -3.70 3.88 -3.88 -1.07 -1.51
Now the sample of 500 transactions, containing the one fraudulent transaction is set to the model endpoint, within the predict() function. The predictions returned by the deployed model are in a string format, separated by commas. Each prediction is a probability of either fraud (Class = 1) or not fraud (Class = 0).
This prediction string needs to be parsed out to a character vector using the stringr package, and then converted to numeric data type. Note: change options to disable scientific notation.
library(stringr)
#disable scientific notation
options(scipen = 999)
# generate predictions
predictions <- model_endpoint$predict(fraud_sample)
# parse out and convert to numeric dataframe
predictions <- str_split(predictions, pattern = ',', simplify = TRUE)
predictions <- as.numeric(predictions)
predictions <- data.frame(predictions)
# take a look at the predictions from largest to smallest
head(data.frame(fraud_prob = predictions[order(-predictions),]))
As indicated above, it appears the XGBoost model has isolated a single observation as having a high probability of being fraudulent. To find out if the model did find the right fraudulent transaction, we need to compare to the actual data.
Below, we column-bind the predictions to the original fraud sample of 500, and filter down to the 1 fraudulent transaction. In doing so, we see that the deployed model did successfully find the fraudulent transaction.
actuals <- cbind(predictions, actuals)
actuals[1:8] %>% dplyr::filter(Class == 1)
This is a quick-n-dirty confusion matrix, created by first establishing a decision threshold (probability cut-point) of 0.8. Any transaction with a probability exceeding 80% will be considered fraudulent and flagged. The predictions are cross-tabbed against the actuals, for an accuracy of 100%. All non-fraud transactions were correctly classified as ‘0’, and the 1 fraudulent transaction was correctly classified as a ‘1’.
Again, while this seems encouraging, we are only testing on a 500 transaction sample. A more robust testing proceedure should be completed.
# use decision threshold for binary classification
actuals <- actuals %>%
dplyr::mutate(pred_class = ifelse(predictions > 0.9,
as.numeric(1),
as.numeric(0)))
# create confusion matrix
table(actuals$pred_class, actuals$Class)
##
## 0 1
## 0 499 0
## 1 0 1
The results of the demo using XGBoost on the Kaggle Credit Card Fraud data were promising. Admittedly, the heavy lifting had been done in terms of data collection, cleaning and preprocessing. The success of the exercise can largely be attributed to the data quality and the robust strength of the XGBoost algorithm. While only a sample of the actual test data was run, achievinig a 100% accuracy was a small win. The approach outlined in this notebook could serve as a stepping stone to a fully expanded fraud detection production environment. Thanks for reading!