Abstract

For every new innovation or feature that the Open Metaverse brings, there are more venues & opportunities for scammers to exploit. Given the risks associated with this new world, companies are hesitant to bring business into the Metaverse for the fear of potential loses. This hesitancy by corporations may potentially delay the adoption and growth of this new technology which shows promise.

Machine learning algorithms are extremely valuable in the detection of fraudulent activity. Some common tools used include decision tree models and support vector machines (SVM).

Keywords

Supervised Machine learning, Classification Models, Dimensionality, Support Vectors, Fraud Detection, Support Vector Machine, Decision Tree

Introduction

The following data set taken from ‘https://www.kaggle.com/datasets/faizaniftikharjanjua/metaverse-financial-transactions-dataset?resource=download’ contains over 78k+ rows of blockchain financial transactions. This includes information on the time the transaction took place, transactional value, recipient and sender block chain address and user behavior.

In this analysis I will create and compare the performance between a decision tree model and support vector machine model in determining whether a transaction is high, medium, or low risk.

Algorithms

Decision Trees: Decision trees are a supervised learning algorithmn that divides a dataset into a tree-like model to determine a classification. It starts with a node, which is the most predictive feature, and then divides into other leafs of descending predictive significance, eventually reaching the predicted class label or outcome. This model will be used because it’s ability to perform feature selection, which is helpful for us to which variables are most influential in detecting fraud. It is also efficient at handling a large dataset and being able to analyze financial transactions which occur typically at high volumes.

Support Vector Machine: A support vector machine (SVM) is a supervised machine learning model that utilizes distance to find the optimal hyperplane or boundary to perform data classification. Unlike other machine learning models which uses all the data for decision making, SVMs rely on the data points that are most difficult to classify. These points are called support vectors and they help determine the position of the hyperplane.

Objective

The objective will be to create a model that can accurately predict transactions that are deemed high risk. Using the dataset, I will create a decision tree and a support vector machine model. Then I will compare results and determine the better predictive model in this analysis.

SVM Model

Data Exploration

 In the dataset (78K+ rows) titled ‘metav.data’ contains 78+ rows of data and contains the 14 variables listed below.

Our target variable will be ‘anomaly’. This column contains the labels “high_risk”, “moderate_risk” and low_risk”.

Variable Name Definition
timestamp Date and time of transaction
hour_of_day Hour of transaction
sending_address Sender blockchain address
receiving_address Recipient’s blockchain address
amount Amount of transaction
transaction_type Category of Transaction
location_region Geographical region
ip_prefix IP address start
login_frequency User’s login count
session_duration Time spent in session
purchase_pattern Buying behavior type
age_group User’s activity level
risk_score Transaction risk rating
anomaly Risk level classification
glimpse(metav.data)
## Rows: 78,600
## Columns: 14
## $ timestamp         <chr> "4/11/2022 12:47", "6/14/2022 19:12", "1/18/2022 16:…
## $ hour_of_day       <dbl> 12, 19, 16, 9, 14, 19, 18, 19, 15, 13, 20, 15, 8, 20…
## $ sending_address   <chr> "0x9d32d0bf2c00f41ce7ca01b66e174cc4dcb0c1da", "0xd6e…
## $ receiving_address <chr> "0x39f82e1c09bc6d7baccc1e79e5621ff812f50572", "0x51e…
## $ amount            <dbl> 796.9492, 0.0100, 778.1974, 300.8384, 775.5693, 590.…
## $ transaction_type  <chr> "transfer", "purchase", "purchase", "transfer", "sal…
## $ location_region   <chr> "Europe", "South America", "Asia", "South America", …
## $ ip_prefix         <dbl> 192.000, 172.000, 192.168, 172.000, 172.160, 192.168…
## $ login_frequency   <dbl> 3, 5, 3, 8, 6, 4, 8, 1, 4, 3, 8, 5, 4, 2, 6, 8, 7, 8…
## $ session_duration  <dbl> 48, 61, 74, 111, 100, 66, 103, 32, 42, 79, 85, 54, 6…
## $ purchase_pattern  <chr> "focused", "focused", "focused", "high_value", "high…
## $ age_group         <chr> "established", "established", "established", "vetera…
## $ risk_score        <dbl> 18.7500, 25.0000, 31.2500, 36.7500, 62.5000, 15.7500…
## $ anomaly           <chr> "low_risk", "low_risk", "low_risk", "low_risk", "mod…
summary(metav.data)
##   timestamp          hour_of_day    sending_address    receiving_address 
##  Length:78600       Min.   : 0.00   Length:78600       Length:78600      
##  Class :character   1st Qu.: 6.00   Class :character   Class :character  
##  Mode  :character   Median :12.00   Mode  :character   Mode  :character  
##                     Mean   :11.53                                        
##                     3rd Qu.:18.00                                        
##                     Max.   :23.00                                        
##      amount        transaction_type   location_region      ip_prefix    
##  Min.   :   0.01   Length:78600       Length:78600       Min.   : 10.0  
##  1st Qu.: 331.32   Class :character   Class :character   1st Qu.:172.0  
##  Median : 500.03   Mode  :character   Mode  :character   Median :172.2  
##  Mean   : 502.57                                         Mean   :147.6  
##  3rd Qu.: 669.53                                         3rd Qu.:192.0  
##  Max.   :1557.15                                         Max.   :192.2  
##  login_frequency session_duration purchase_pattern    age_group        
##  Min.   :1.000   Min.   : 20.00   Length:78600       Length:78600      
##  1st Qu.:2.000   1st Qu.: 35.00   Class :character   Class :character  
##  Median :4.000   Median : 60.00   Mode  :character   Mode  :character  
##  Mean   :4.179   Mean   : 69.68                                        
##  3rd Qu.:6.000   3rd Qu.:100.00                                        
##  Max.   :8.000   Max.   :159.00                                        
##    risk_score       anomaly         
##  Min.   : 15.00   Length:78600      
##  1st Qu.: 26.25   Class :character  
##  Median : 40.00   Mode  :character  
##  Mean   : 44.96                     
##  3rd Qu.: 52.50                     
##  Max.   :100.00

Data Preparation

One of the advantages of decision trees is that it requires minimal data preparation because of the algorithm’s splitting criteria. It will ignore nulls and is not influenced by non-normalized data. Support vector models unlike decision trees, require specific data preparation.

When creating a support vector machine certain preparation steps need to completed during data preparation. I will be performing the following:

  1. Removing Nulls: We need to remove nulls because it can affect the model building and training process leading to errors and poor performance

  2. Change data types:

  • Handle categorical variables: SVM does not handle categorical data. As a result, categorical values will have to be converted into a numerical format. I will be using one-hot encoding
  • Numerical data: Ensure that numerical variables are converted to a numerical data type.
  1. Normalize data: If the numerical variables are not appropriately scaled then the SVM model will not have accurate predictions. Because the model relies on distance calculations, this potentially skew towards the data with the larger scale - leading to bias. To reduce this, we need to ensure the data is normalized.

  2. Feature engineering: To get capture more valuable features from certain variables, I will create new data fields from the timestamp category. I will also be binning category variables to reduce the dimensions when using one-hot encoding.

Below I ensure that the timestamp variable is in a timestamp format, and then create the month and day_of_week columns.

# fix data column format
metav.data$timestamp <- as.POSIXct(metav.data$timestamp, format = "%m/%d/%Y %H:%M")

# create month and day of week columns
metav.data <- metav.data %>%
  mutate(
  month = as.numeric(format(timestamp, "%m")), 
  day_of_week = as.numeric(format(timestamp, "%u"))    
  )

I decide to store the values in the quarter, hour_of_day and day_of_week into bins. This help reduce the dimensionality of the dataset and general clutter when creating our models.

# bin categorical data
metav.data <- metav.data %>%
  mutate(
    quarter = cut(month,
                  breaks = c(0, 3, 6, 9, 12),
                  labels = c("Q1", "Q2", "Q3", "Q4"),
                  include.lowest = TRUE),
    time_of_day = cut(hour_of_day,
                      breaks = c(0, 5, 11, 17, 23),
                      labels = c("Night", "Morning", "Afternoon", "Evening"),
                      include.lowest = TRUE),
    day_type = case_when(
      day_of_week %in% 1:5 ~ "Week",
      day_of_week %in% 6:7 ~ "Weekend"
    )
  )

After extracting the necessary information from columns like timestamp and hour_of_day, we can now remove these variables from our dataset. I also removed other columns that aren’t of interest like sending_address, receiving_address, etc. I also perform additional data type transformations to ensure my columns are the appropriate type.

# filter out columns 
metav.data <- metav.data %>%
  select(-timestamp,-hour_of_day,-sending_address,-receiving_address,-ip_prefix,-risk_score,-month,-day_of_week) %>%
  mutate(
  amount = as.numeric(amount),
  login_frequency = as.numeric(login_frequency),
  session_duration = as.numeric(session_duration),
  anomaly = as.factor(anomaly)
  )

Next we’re going to transform our categorical variables into a numeric binary format using one-hot encoding.

# one-hot encoding
metav.data <- dummy_cols(metav.data, select_columns = 'transaction_type')
metav.data <- dummy_cols(metav.data, select_columns = 'location_region')
metav.data <- dummy_cols(metav.data, select_columns = 'purchase_pattern')
metav.data <- dummy_cols(metav.data, select_columns = 'age_group')
metav.data <- dummy_cols(metav.data, select_columns = 'time_of_day')
metav.data <- dummy_cols(metav.data, select_columns = 'quarter')
metav.data <- dummy_cols(metav.data, select_columns = 'day_type')

Normalize data

When looking at the distribution of the numeric columns (amount, login_frequency and session duration), we can see they are not normalized and have different scales. To prevent one variable from being overly influential. I will normalize all three variables.

# Prepare data for faceted plot
data_long <- reshape2::melt(metav.data[, c("amount", "login_frequency", "session_duration")])
## No id variables; using all as measure variables
# Create faceted density plot
ggplot(data_long, aes(x = value)) +
  geom_density(fill = "steelblue", alpha = 0.5) +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Density Plots for Normalized Variables") +
  xlab("Value") +
  ylab("Density")

Below we can create the nromalized data and store them in their perspective columns.

# Normalize the specified columns
preProcValues <- preProcess(metav.data[, c('amount', 'login_frequency', 'session_duration')], method = c("center", "scale"))
norm_data <- predict(preProcValues, metav.data[, c('amount', 'login_frequency', 'session_duration')])

# Replace the original columns in metav.data with the normalized columns
metav.data$amount <- norm_data$amount
metav.data$login_frequency <- norm_data$login_frequency
metav.data$session_duration <- norm_data$session_duration

Inspect nulls

We have spotted a few Nulls in the Quarter columns. Since they are a few, I decided to remove them.

# check nulls
colSums(is.na(metav.data))
##                        amount              transaction_type 
##                             0                             0 
##               location_region               login_frequency 
##                             0                             0 
##              session_duration              purchase_pattern 
##                             0                             0 
##                     age_group                       anomaly 
##                             0                             0 
##                       quarter                   time_of_day 
##                             8                             0 
##                      day_type     transaction_type_phishing 
##                             8                             0 
##     transaction_type_purchase         transaction_type_sale 
##                             0                             0 
##         transaction_type_scam     transaction_type_transfer 
##                             0                             0 
##        location_region_Africa          location_region_Asia 
##                             0                             0 
##        location_region_Europe location_region_North America 
##                             0                             0 
## location_region_South America      purchase_pattern_focused 
##                             0                             0 
##   purchase_pattern_high_value       purchase_pattern_random 
##                             0                             0 
##         age_group_established                 age_group_new 
##                             0                             0 
##             age_group_veteran             time_of_day_Night 
##                             0                             0 
##           time_of_day_Morning         time_of_day_Afternoon 
##                             0                             0 
##           time_of_day_Evening                    quarter_Q1 
##                             0                             8 
##                    quarter_Q2                    quarter_Q3 
##                             8                             8 
##                    quarter_Q4                    quarter_NA 
##                             8                             0 
##                 day_type_Week              day_type_Weekend 
##                             8                             8 
##                   day_type_NA 
##                             0
metav.data <- na.omit(metav.data)

Finalize dataset

Finally, I remove any remaining columns that are not of interest.

# filter out columns 
metav.data <- metav.data %>%
  select(-transaction_type,-location_region,-purchase_pattern,-age_group,-quarter,-time_of_day,-day_type,)

Model Building

To begin the process of building our model, we start by creating our training and testing dataset.

set.seed(123)
sample_set <- sample(nrow(metav.data), round(nrow(metav.data)*.75), replace = FALSE)
mt_train <- metav.data[sample_set, ]
mt_test <- metav.data[-sample_set, ]

Upon looking at the proportion our target variable anomaly, we can see that we are dealing with an imbalanced dataset. The class low_risk account for 80% of the original, training and testing data set. We will have to rebalance the dataset to ensure that our training model can perform better on data that it hasn’t seen before.

round(prop.table(table(select(metav.data, anomaly))),4) * 100
## anomaly
##     high_risk      low_risk moderate_risk 
##          8.26         80.78         10.96
round(prop.table(table(select(mt_train, anomaly))),4) * 100
## anomaly
##     high_risk      low_risk moderate_risk 
##          8.28         80.76         10.96
round(prop.table(table(select(mt_test, anomaly))),4) * 100
## anomaly
##     high_risk      low_risk moderate_risk 
##          8.22         80.84         10.93

Rebalancing

To rebalance the dataset, we will utilize the downSample function. Downsampling is a technique used to address class imbalance. It randomly removes samples from the majority class to match the number of data in the minority class(es). Once used, we can each class in the column is now evenly distributed.

set.seed(234)
mt_train <- downSample(x = mt_train[, -which(names(mt_train) == "anomaly")], y = mt_train$anomaly)

round(prop.table(table(select(mt_train, Class))),4) * 100
## Class
##     high_risk      low_risk moderate_risk 
##         33.33         33.33         33.33

Predictions

# Train the SVM model
svm_model <- svm(Class ~ ., data = mt_train, method = "C-classification", kernel = "radial")

# Predict on test data
predictions <- predict(svm_model, mt_test)

# Evaluate the model
confusionMatrix(predictions, mt_test$anomaly)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      high_risk low_risk moderate_risk
##   high_risk          1616        6             0
##   low_risk              0    14679             9
##   moderate_risk         0     1199          2139
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9382          
##                  95% CI : (0.9348, 0.9415)
##     No Information Rate : 0.8084          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8331          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: high_risk Class: low_risk Class: moderate_risk
## Sensitivity                   1.00000          0.9241               0.9958
## Specificity                   0.99967          0.9976               0.9315
## Pos Pred Value                0.99630          0.9994               0.6408
## Neg Pred Value                1.00000          0.7571               0.9994
## Prevalence                    0.08225          0.8084               0.1093
## Detection Rate                0.08225          0.7471               0.1089
## Detection Prevalence          0.08255          0.7476               0.1699
## Balanced Accuracy             0.99983          0.9609               0.9636

Overall, our model scored ~94% in accuracy - a very good sign. The low p-value tells us that our model outperforms the No Information Rate (NIR), which is the accuracy that one would get if we were to predict the most frequent class in our dataset.

When looking at each class of our target variable, we see that predictions for high_risk and low_risk were highly accurate at ~99% accuracy.For transactions deemed moderate_risk, however, it didn’t perform as well with only 64% of positive predictions labeled correct.

Another detail to note is that the model performed poorly with correctly labeling transaction as non low_risk.


Decision Tree

Data Preparation

# fix date values
metav.data2$timestamp <- as.POSIXct(metav.data2$timestamp, format = "%m/%d/%Y %H:%M")

metav.data2 <- metav.data2 %>%
  mutate(
  month = as.factor(format(timestamp, "%m")), 
  day_of_week = as.factor(format(timestamp, "%u"))    
  )
# filter out columns 
metav.data2 <- metav.data2 %>%
  select(-sending_address,-receiving_address,-ip_prefix,-timestamp, -risk_score) %>%
  mutate(
  hour_of_day = as.factor(hour_of_day),
  transaction_type = as.factor(transaction_type),
  location_region = as.factor(location_region),
  purchase_pattern = as.factor(purchase_pattern),
  age_group = as.factor(age_group),
  anomaly = as.factor(anomaly)
  )
metav.data2 <- na.omit(metav.data2)

Model Building

The data is now clean and ready to be split into our training and testing dataset. looking at the proportions for our target variable anomaly, we can see that our training dataset needs to be rebalanced. We see that the label “low risk” comprises 80% of the column. This is important to address for our training set because if not rebalanced it can make predicting on data that is not of similar proportion difficult.

set.seed(411)
sample_set2 <- sample(nrow(metav.data2), round(nrow(metav.data2)*.75), replace = FALSE)
mt_train2 <- metav.data2[sample_set2, ]
mt_test2 <- metav.data2[-sample_set2, ]
round(prop.table(table(select(metav.data2, anomaly))),4) * 100
## anomaly
##     high_risk      low_risk moderate_risk 
##          8.26         80.78         10.96
round(prop.table(table(select(mt_train2, anomaly))),4) * 100
## anomaly
##     high_risk      low_risk moderate_risk 
##          8.30         80.73         10.97
round(prop.table(table(select(mt_test2, anomaly))),4) * 100
## anomaly
##     high_risk      low_risk moderate_risk 
##          8.15         80.94         10.90

Rebalancing

To rebalance the dataset, we will utilize the downSample function. Downsampling is a technique used to address class imbalance. It randomly removes samples from the majority class to match the number of data in the minority class(es). Once used, we can each class in the column is now evenly distributed.

set.seed(511)
mt_train2 <- downSample(x = mt_train2[, -which(names(mt_train2) == "anomaly")], y = mt_train2$anomaly)

round(prop.table(table(select(mt_train2, Class))),4) * 100
## Class
##     high_risk      low_risk moderate_risk 
##         33.33         33.33         33.33

When visualizing the decision tree, we’re shown that the variables the model deemed significant predictors is ‘risk_score’ and ‘transaction_type’.

mt_mod <- rpart(Class ~ ., method = "class", data = mt_train2)
## Warning: package 'rpart.plot' was built under R version 4.3.3

Predictions

When testing to see the accuracy of the decision tree, we can see that it scored 100%! Though it might sound good, this presents a problem when using this model to predict on outside data. This tell us the model is excellent at correctly classifying the high risk transactions for this dataset but will struggle performing well with data it hasn’t seen before.

mt_pred <- predict(mt_mod, mt_test2, type = "class")
mt_pred_table <- table(mt_test2$anomaly, mt_pred)
sum(diag(mt_pred_table)/nrow(mt_test2))
## [1] 0.9665106

Final Thoughts

The decision tree model outperformed the SVM model in terms of accuracy. Our SVM scored ~94%, while the decision tree scored ~97%.

The SVM model despite the non-normal distribution handled the high-dimensional dataset well and scored a high accuracy score. The decision tree was simple to set up, required less data preparation and scored a higher accuracy. For the reasons of Interpretability and simplicity, I believe that in this case the decision tree model is the superior model for classification. This also follows the recommendation of the studies from earlier.

In a further study, I would perform cross-validation to see which model performs best against unseen data.

References

  1. Ahmad, A., Safi, O., Malebary, S., Alesawi, S., Alkayal, E.: Decision tree ensembles to predict coronavirus disease 2019 infection: a comparative study. Complexity, 1–8 (2021)

  2. Guhathakurata S, Kundu S, Chakraborty A, Banerjee JS. A novel approach to predict COVID-19 using support vector machine. Data Science for COVID-19. 2021:351–64. doi: 10.1016/B978-0-12-824536-1.00014-9. Epub 2021 May 21. PMCID: PMC8137961.

  3. Sahin, Y., & Duman, E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol I, IMECS 2011, March 16-18, Hong Kong. ISBN: 978-988-18210-3-4. ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online).

  4. Jain, Y., Tiwari, N., Dubey, S., & Jain, S. (2019). A comparative analysis of various credit card fraud detection techniques. International Journal of Recent Technology and Engineering (IJRTE), 7(5S2).

  5. Sahin, Y., Bulkan, S., & Duman, E. (2013). A cost-sensitive decision tree approach for fraud detection. Expert Syst. Appl., 40, 5916-5923.

  6. Chen, Q. (2010). Predictive modeling for non-profit fundraising. James Madison University.