Machine learning algorithms are extremely valuable in the detection of fraudulent activity. Some common tools used include decision tree models and support vector machines (SVM).
The following data set taken from ‘https://www.kaggle.com/datasets/faizaniftikharjanjua/metaverse-financial-transactions-dataset?resource=download’ contains over 78k+ rows of blockchain financial transactions. This includes information on the time the transaction took place, transactional value, recipient and sender block chain address and user behavior.
In this analysis I will create and compare the performance between a decision tree model and support vector machine model in determining whether a transaction is high, medium, or low risk.
Decision Trees: Decision trees are a supervised learning algorithmn that divides a dataset into a tree-like model to determine a classification. It starts with a node, which is the most predictive feature, and then divides into other leafs of descending predictive significance, eventually reaching the predicted class label or outcome. This model will be used because it’s ability to perform feature selection, which is helpful for us to which variables are most influential in detecting fraud. It is also efficient at handling a large dataset and being able to analyze financial transactions which occur typically at high volumes.
Support Vector Machine: A support vector machine (SVM) is a supervised machine learning model that utilizes distance to find the optimal hyperplane or boundary to perform data classification. Unlike other machine learning models which uses all the data for decision making, SVMs rely on the data points that are most difficult to classify. These points are called support vectors and they help determine the position of the hyperplane.
The objective will be to create a model that can accurately predict transactions that are deemed high risk. Using the dataset, I will create a decision tree and a support vector machine model. Then I will compare results and determine the better predictive model in this analysis.
Our target variable will be ‘anomaly’. This column contains the labels “high_risk”, “moderate_risk” and low_risk”.
| Variable Name | Definition |
|---|---|
| timestamp | Date and time of transaction |
| hour_of_day | Hour of transaction |
| sending_address | Sender blockchain address |
| receiving_address | Recipient’s blockchain address |
| amount | Amount of transaction |
| transaction_type | Category of Transaction |
| location_region | Geographical region |
| ip_prefix | IP address start |
| login_frequency | User’s login count |
| session_duration | Time spent in session |
| purchase_pattern | Buying behavior type |
| age_group | User’s activity level |
| risk_score | Transaction risk rating |
| anomaly | Risk level classification |
glimpse(metav.data)
## Rows: 78,600
## Columns: 14
## $ timestamp <chr> "4/11/2022 12:47", "6/14/2022 19:12", "1/18/2022 16:…
## $ hour_of_day <dbl> 12, 19, 16, 9, 14, 19, 18, 19, 15, 13, 20, 15, 8, 20…
## $ sending_address <chr> "0x9d32d0bf2c00f41ce7ca01b66e174cc4dcb0c1da", "0xd6e…
## $ receiving_address <chr> "0x39f82e1c09bc6d7baccc1e79e5621ff812f50572", "0x51e…
## $ amount <dbl> 796.9492, 0.0100, 778.1974, 300.8384, 775.5693, 590.…
## $ transaction_type <chr> "transfer", "purchase", "purchase", "transfer", "sal…
## $ location_region <chr> "Europe", "South America", "Asia", "South America", …
## $ ip_prefix <dbl> 192.000, 172.000, 192.168, 172.000, 172.160, 192.168…
## $ login_frequency <dbl> 3, 5, 3, 8, 6, 4, 8, 1, 4, 3, 8, 5, 4, 2, 6, 8, 7, 8…
## $ session_duration <dbl> 48, 61, 74, 111, 100, 66, 103, 32, 42, 79, 85, 54, 6…
## $ purchase_pattern <chr> "focused", "focused", "focused", "high_value", "high…
## $ age_group <chr> "established", "established", "established", "vetera…
## $ risk_score <dbl> 18.7500, 25.0000, 31.2500, 36.7500, 62.5000, 15.7500…
## $ anomaly <chr> "low_risk", "low_risk", "low_risk", "low_risk", "mod…
summary(metav.data)
## timestamp hour_of_day sending_address receiving_address
## Length:78600 Min. : 0.00 Length:78600 Length:78600
## Class :character 1st Qu.: 6.00 Class :character Class :character
## Mode :character Median :12.00 Mode :character Mode :character
## Mean :11.53
## 3rd Qu.:18.00
## Max. :23.00
## amount transaction_type location_region ip_prefix
## Min. : 0.01 Length:78600 Length:78600 Min. : 10.0
## 1st Qu.: 331.32 Class :character Class :character 1st Qu.:172.0
## Median : 500.03 Mode :character Mode :character Median :172.2
## Mean : 502.57 Mean :147.6
## 3rd Qu.: 669.53 3rd Qu.:192.0
## Max. :1557.15 Max. :192.2
## login_frequency session_duration purchase_pattern age_group
## Min. :1.000 Min. : 20.00 Length:78600 Length:78600
## 1st Qu.:2.000 1st Qu.: 35.00 Class :character Class :character
## Median :4.000 Median : 60.00 Mode :character Mode :character
## Mean :4.179 Mean : 69.68
## 3rd Qu.:6.000 3rd Qu.:100.00
## Max. :8.000 Max. :159.00
## risk_score anomaly
## Min. : 15.00 Length:78600
## 1st Qu.: 26.25 Class :character
## Median : 40.00 Mode :character
## Mean : 44.96
## 3rd Qu.: 52.50
## Max. :100.00
One of the advantages of decision trees is that it requires minimal data preparation because of the algorithm’s splitting criteria. It will ignore nulls and is not influenced by non-normalized data. Support vector models unlike decision trees, require specific data preparation.
When creating a support vector machine certain preparation steps need to completed during data preparation. I will be performing the following:
Removing Nulls: We need to remove nulls because it can affect the model building and training process leading to errors and poor performance
Change data types:
Normalize data: If the numerical variables are not appropriately scaled then the SVM model will not have accurate predictions. Because the model relies on distance calculations, this potentially skew towards the data with the larger scale - leading to bias. To reduce this, we need to ensure the data is normalized.
Feature engineering: To get capture more valuable features from certain variables, I will create new data fields from the timestamp category. I will also be binning category variables to reduce the dimensions when using one-hot encoding.
Below I ensure that the timestamp variable is in a timestamp format, and then create the month and day_of_week columns.
# fix data column format
metav.data$timestamp <- as.POSIXct(metav.data$timestamp, format = "%m/%d/%Y %H:%M")
# create month and day of week columns
metav.data <- metav.data %>%
mutate(
month = as.numeric(format(timestamp, "%m")),
day_of_week = as.numeric(format(timestamp, "%u"))
)
I decide to store the values in the quarter, hour_of_day and day_of_week into bins. This help reduce the dimensionality of the dataset and general clutter when creating our models.
# bin categorical data
metav.data <- metav.data %>%
mutate(
quarter = cut(month,
breaks = c(0, 3, 6, 9, 12),
labels = c("Q1", "Q2", "Q3", "Q4"),
include.lowest = TRUE),
time_of_day = cut(hour_of_day,
breaks = c(0, 5, 11, 17, 23),
labels = c("Night", "Morning", "Afternoon", "Evening"),
include.lowest = TRUE),
day_type = case_when(
day_of_week %in% 1:5 ~ "Week",
day_of_week %in% 6:7 ~ "Weekend"
)
)
After extracting the necessary information from columns like timestamp and hour_of_day, we can now remove these variables from our dataset. I also removed other columns that aren’t of interest like sending_address, receiving_address, etc. I also perform additional data type transformations to ensure my columns are the appropriate type.
# filter out columns
metav.data <- metav.data %>%
select(-timestamp,-hour_of_day,-sending_address,-receiving_address,-ip_prefix,-risk_score,-month,-day_of_week) %>%
mutate(
amount = as.numeric(amount),
login_frequency = as.numeric(login_frequency),
session_duration = as.numeric(session_duration),
anomaly = as.factor(anomaly)
)
Next we’re going to transform our categorical variables into a numeric binary format using one-hot encoding.
# one-hot encoding
metav.data <- dummy_cols(metav.data, select_columns = 'transaction_type')
metav.data <- dummy_cols(metav.data, select_columns = 'location_region')
metav.data <- dummy_cols(metav.data, select_columns = 'purchase_pattern')
metav.data <- dummy_cols(metav.data, select_columns = 'age_group')
metav.data <- dummy_cols(metav.data, select_columns = 'time_of_day')
metav.data <- dummy_cols(metav.data, select_columns = 'quarter')
metav.data <- dummy_cols(metav.data, select_columns = 'day_type')
When looking at the distribution of the numeric columns (amount, login_frequency and session duration), we can see they are not normalized and have different scales. To prevent one variable from being overly influential. I will normalize all three variables.
# Prepare data for faceted plot
data_long <- reshape2::melt(metav.data[, c("amount", "login_frequency", "session_duration")])
## No id variables; using all as measure variables
# Create faceted density plot
ggplot(data_long, aes(x = value)) +
geom_density(fill = "steelblue", alpha = 0.5) +
facet_wrap(~ variable, scales = "free") +
ggtitle("Density Plots for Normalized Variables") +
xlab("Value") +
ylab("Density")
Below we can create the nromalized data and store them in their perspective columns.
# Normalize the specified columns
preProcValues <- preProcess(metav.data[, c('amount', 'login_frequency', 'session_duration')], method = c("center", "scale"))
norm_data <- predict(preProcValues, metav.data[, c('amount', 'login_frequency', 'session_duration')])
# Replace the original columns in metav.data with the normalized columns
metav.data$amount <- norm_data$amount
metav.data$login_frequency <- norm_data$login_frequency
metav.data$session_duration <- norm_data$session_duration
We have spotted a few Nulls in the Quarter columns. Since they are a few, I decided to remove them.
# check nulls
colSums(is.na(metav.data))
## amount transaction_type
## 0 0
## location_region login_frequency
## 0 0
## session_duration purchase_pattern
## 0 0
## age_group anomaly
## 0 0
## quarter time_of_day
## 8 0
## day_type transaction_type_phishing
## 8 0
## transaction_type_purchase transaction_type_sale
## 0 0
## transaction_type_scam transaction_type_transfer
## 0 0
## location_region_Africa location_region_Asia
## 0 0
## location_region_Europe location_region_North America
## 0 0
## location_region_South America purchase_pattern_focused
## 0 0
## purchase_pattern_high_value purchase_pattern_random
## 0 0
## age_group_established age_group_new
## 0 0
## age_group_veteran time_of_day_Night
## 0 0
## time_of_day_Morning time_of_day_Afternoon
## 0 0
## time_of_day_Evening quarter_Q1
## 0 8
## quarter_Q2 quarter_Q3
## 8 8
## quarter_Q4 quarter_NA
## 8 0
## day_type_Week day_type_Weekend
## 8 8
## day_type_NA
## 0
metav.data <- na.omit(metav.data)
Finally, I remove any remaining columns that are not of interest.
# filter out columns
metav.data <- metav.data %>%
select(-transaction_type,-location_region,-purchase_pattern,-age_group,-quarter,-time_of_day,-day_type,)
To begin the process of building our model, we start by creating our training and testing dataset.
set.seed(123)
sample_set <- sample(nrow(metav.data), round(nrow(metav.data)*.75), replace = FALSE)
mt_train <- metav.data[sample_set, ]
mt_test <- metav.data[-sample_set, ]
Upon looking at the proportion our target variable anomaly, we can see that we are dealing with an imbalanced dataset. The class low_risk account for 80% of the original, training and testing data set. We will have to rebalance the dataset to ensure that our training model can perform better on data that it hasn’t seen before.
round(prop.table(table(select(metav.data, anomaly))),4) * 100
## anomaly
## high_risk low_risk moderate_risk
## 8.26 80.78 10.96
round(prop.table(table(select(mt_train, anomaly))),4) * 100
## anomaly
## high_risk low_risk moderate_risk
## 8.28 80.76 10.96
round(prop.table(table(select(mt_test, anomaly))),4) * 100
## anomaly
## high_risk low_risk moderate_risk
## 8.22 80.84 10.93
To rebalance the dataset, we will utilize the downSample function. Downsampling is a technique used to address class imbalance. It randomly removes samples from the majority class to match the number of data in the minority class(es). Once used, we can each class in the column is now evenly distributed.
set.seed(234)
mt_train <- downSample(x = mt_train[, -which(names(mt_train) == "anomaly")], y = mt_train$anomaly)
round(prop.table(table(select(mt_train, Class))),4) * 100
## Class
## high_risk low_risk moderate_risk
## 33.33 33.33 33.33
# Train the SVM model
svm_model <- svm(Class ~ ., data = mt_train, method = "C-classification", kernel = "radial")
# Predict on test data
predictions <- predict(svm_model, mt_test)
# Evaluate the model
confusionMatrix(predictions, mt_test$anomaly)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high_risk low_risk moderate_risk
## high_risk 1616 6 0
## low_risk 0 14679 9
## moderate_risk 0 1199 2139
##
## Overall Statistics
##
## Accuracy : 0.9382
## 95% CI : (0.9348, 0.9415)
## No Information Rate : 0.8084
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8331
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: high_risk Class: low_risk Class: moderate_risk
## Sensitivity 1.00000 0.9241 0.9958
## Specificity 0.99967 0.9976 0.9315
## Pos Pred Value 0.99630 0.9994 0.6408
## Neg Pred Value 1.00000 0.7571 0.9994
## Prevalence 0.08225 0.8084 0.1093
## Detection Rate 0.08225 0.7471 0.1089
## Detection Prevalence 0.08255 0.7476 0.1699
## Balanced Accuracy 0.99983 0.9609 0.9636
Overall, our model scored ~94% in accuracy - a very good sign. The low p-value tells us that our model outperforms the No Information Rate (NIR), which is the accuracy that one would get if we were to predict the most frequent class in our dataset.
When looking at each class of our target variable, we see that predictions for high_risk and low_risk were highly accurate at ~99% accuracy.For transactions deemed moderate_risk, however, it didn’t perform as well with only 64% of positive predictions labeled correct.
Another detail to note is that the model performed poorly with correctly labeling transaction as non low_risk.
# fix date values
metav.data2$timestamp <- as.POSIXct(metav.data2$timestamp, format = "%m/%d/%Y %H:%M")
metav.data2 <- metav.data2 %>%
mutate(
month = as.factor(format(timestamp, "%m")),
day_of_week = as.factor(format(timestamp, "%u"))
)
# filter out columns
metav.data2 <- metav.data2 %>%
select(-sending_address,-receiving_address,-ip_prefix,-timestamp, -risk_score) %>%
mutate(
hour_of_day = as.factor(hour_of_day),
transaction_type = as.factor(transaction_type),
location_region = as.factor(location_region),
purchase_pattern = as.factor(purchase_pattern),
age_group = as.factor(age_group),
anomaly = as.factor(anomaly)
)
metav.data2 <- na.omit(metav.data2)
The data is now clean and ready to be split into our training and testing dataset. looking at the proportions for our target variable anomaly, we can see that our training dataset needs to be rebalanced. We see that the label “low risk” comprises 80% of the column. This is important to address for our training set because if not rebalanced it can make predicting on data that is not of similar proportion difficult.
set.seed(411)
sample_set2 <- sample(nrow(metav.data2), round(nrow(metav.data2)*.75), replace = FALSE)
mt_train2 <- metav.data2[sample_set2, ]
mt_test2 <- metav.data2[-sample_set2, ]
round(prop.table(table(select(metav.data2, anomaly))),4) * 100
## anomaly
## high_risk low_risk moderate_risk
## 8.26 80.78 10.96
round(prop.table(table(select(mt_train2, anomaly))),4) * 100
## anomaly
## high_risk low_risk moderate_risk
## 8.30 80.73 10.97
round(prop.table(table(select(mt_test2, anomaly))),4) * 100
## anomaly
## high_risk low_risk moderate_risk
## 8.15 80.94 10.90
To rebalance the dataset, we will utilize the downSample function. Downsampling is a technique used to address class imbalance. It randomly removes samples from the majority class to match the number of data in the minority class(es). Once used, we can each class in the column is now evenly distributed.
set.seed(511)
mt_train2 <- downSample(x = mt_train2[, -which(names(mt_train2) == "anomaly")], y = mt_train2$anomaly)
round(prop.table(table(select(mt_train2, Class))),4) * 100
## Class
## high_risk low_risk moderate_risk
## 33.33 33.33 33.33
When visualizing the decision tree, we’re shown that the variables the model deemed significant predictors is ‘risk_score’ and ‘transaction_type’.
mt_mod <- rpart(Class ~ ., method = "class", data = mt_train2)
## Warning: package 'rpart.plot' was built under R version 4.3.3
When testing to see the accuracy of the decision tree, we can see that it scored 100%! Though it might sound good, this presents a problem when using this model to predict on outside data. This tell us the model is excellent at correctly classifying the high risk transactions for this dataset but will struggle performing well with data it hasn’t seen before.
mt_pred <- predict(mt_mod, mt_test2, type = "class")
mt_pred_table <- table(mt_test2$anomaly, mt_pred)
sum(diag(mt_pred_table)/nrow(mt_test2))
## [1] 0.9665106
The decision tree model outperformed the SVM model in terms of accuracy. Our SVM scored ~94%, while the decision tree scored ~97%.
The SVM model despite the non-normal distribution handled the high-dimensional dataset well and scored a high accuracy score. The decision tree was simple to set up, required less data preparation and scored a higher accuracy. For the reasons of Interpretability and simplicity, I believe that in this case the decision tree model is the superior model for classification. This also follows the recommendation of the studies from earlier.
In a further study, I would perform cross-validation to see which model performs best against unseen data.
Ahmad, A., Safi, O., Malebary, S., Alesawi, S., Alkayal, E.: Decision tree ensembles to predict coronavirus disease 2019 infection: a comparative study. Complexity, 1–8 (2021)
Guhathakurata S, Kundu S, Chakraborty A, Banerjee JS. A novel approach to predict COVID-19 using support vector machine. Data Science for COVID-19. 2021:351–64. doi: 10.1016/B978-0-12-824536-1.00014-9. Epub 2021 May 21. PMCID: PMC8137961.
Sahin, Y., & Duman, E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol I, IMECS 2011, March 16-18, Hong Kong. ISBN: 978-988-18210-3-4. ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online).
Jain, Y., Tiwari, N., Dubey, S., & Jain, S. (2019). A comparative analysis of various credit card fraud detection techniques. International Journal of Recent Technology and Engineering (IJRTE), 7(5S2).
Sahin, Y., Bulkan, S., & Duman, E. (2013). A cost-sensitive decision tree approach for fraud detection. Expert Syst. Appl., 40, 5916-5923.
Chen, Q. (2010). Predictive modeling for non-profit fundraising. James Madison University.