Data622_Assignment3

Abstract

For every new innovation or feature that the Open Metaverse brings, there are more venues & opportunities for scammers to exploit. Given the risks associated with this new world, companies are hesitant to bring business into the Metaverse for the fear of potential loses. This hesitancy by corporations may potentially delay the adoption and growth of this new technology which shows promise.

Machine learning algorithms are extremely valuable in the detection of fraudulent activity. Some common tools used include decision tree models and support vector machines (SVM).

Keywords

Supervised Machine learning, Classification Models, Dimensionality, Support Vectors, Fraud Detection, Support Vector Machine, Decision Tree

Introduction

The following data set taken from ‘https://www.kaggle.com/datasets/faizaniftikharjanjua/metaverse-financial-transactions-dataset?resource=download’ contains over 78k+ rows of blockchain financial transactions. This includes information on the time the transaction took place, transactional value, recipient and sender block chain address and user behavior.

In this analysis I will create and compare the performance between a decision tree model and support vector machine model in determining whether a transaction is high, medium, or low risk.

Algorithms

Decision Trees: Decision trees are a supervised learning algorithmn that divides a dataset into a tree-like model to determine a classification. It starts with a node, which is the most predictive feature, and then divides into other leafs of descending predictive significance, eventually reaching the predicted class label or outcome. This model will be used because it’s ability to perform feature selection, which is helpful for us to which variables are most influential in detecting fraud. It is also efficient at handling a large dataset and being able to analyze financial transactions which occur typically at high volumes.

Support Vector Machine: A support vector machine (SVM) is a supervised machine learning model that utilizes distance to find the optimal hyperplane or boundary to perform data classification. Unlike other machine learning models which uses all the data for decision making, SVMs rely on the data points that are most difficult to classify. These points are called support vectors and they help determine the position of the hyperplane.

Related Studies

Classification using SVM & Decision Trees

Machine learning for the purpose of classification, has addressed other global issues. One of those times was during the COVID-19 outbreak. The studies below discuss the models and their use in classifying positive COVID patients.

Amir Ahmad et al. [1] used laboratory findings from patients at a hospital in São Paulo, Brazil to predict COVID-19 cases. They created a decision tree ensemble, which is a combination of decision tree classifiers, to arrive to a final prediction. They compared standard classifiers and decision tree classifiers for imblanaced data. Ultimately they found that because the data is imbalanced, having an ensemble that addresses this, works best.

Soham Guhathakurata et al. [2] used the SVM model to predict positive COVID cases. Their dataset included common symptoms of COVID-19 cases including fever, breathing rate, cough, hypertension, and chest pain to help with it’s classification. This model acheived an accuracy of 87% in predicting the cases.

Fraud Detection using SVM & Decision Trees

Decision tree models and support vector machines historically, have been used by other researchers to detect fraud to varying degrees of success.

Yusuf Sahin et al. [3] investigated whether they can predict fraud in credit card transactions using both decision tree and SVM models. The dataset used was imbalanced with a majority of transactions being legitimate vs fraudulent. They found that though both models performed well, the decision tree model were better at generalizing outside the training dataset and could handle the categorical variables (Type of Transaction, MCC codes) of these transactions well. Consequently, the SVM model tended to over fit the data and was less effective with unseen data.

Similarly, Yashvi Jain et al. [4] suggests using robust fraud detection systems in machine learning models to combat credit card fraud. When comparing the decision tree models and SVM, they observed that both scored above 90% in accuracy, but that SVM exhibited higher false alarm rates which hurt it’s precision slightly. The authors mentioned the higher accuracy, lower false alarm rate and easier interpretability, in giving the edge to decision trees.

Yusuf Sahin et al. [5] in another study applied a different approach to using decision tree models in credit card fraud prediction. Their cost-saving models utilize the economic costs of misclassifications into the decision tree algorithm. For the split criterion, it creates the seperations based on minimizing the expected misclassification costs instead of maximizing gain or reducing impurity. This type of decision tree outperformed tradition decision tree methods and SVM models in detecting more fraudulent transactions.

Objective

The objective will be to create a model that can accurately predict transactions that are deemed high risk. Using the dataset, I will create a decision tree and a support vector machine model. Then I will compare results and determine the better predictive model in this analysis.

SVM Model

Data Exploration

In the dataset (78K+ rows) titled ‘metav.data’ contains 78+ rows of data and contains the 14 variables listed below.

Our target variable will be ‘anomaly’. This column contains the labels “high_risk”, “moderate_risk” and low_risk”.

Variable Name	Definition
timestamp	Date and time of transaction
hour_of_day	Hour of transaction
sending_address	Sender blockchain address
receiving_address	Recipient’s blockchain address
amount	Amount of transaction
transaction_type	Category of Transaction
location_region	Geographical region
ip_prefix	IP address start
login_frequency	User’s login count
session_duration	Time spent in session
purchase_pattern	Buying behavior type
age_group	User’s activity level
risk_score	Transaction risk rating
anomaly	Risk level classification

glimpse(metav.data)

## Rows: 78,600
## Columns: 14
## $ timestamp         <chr> "4/11/2022 12:47", "6/14/2022 19:12", "1/18/2022 16:…
## $ hour_of_day       <dbl> 12, 19, 16, 9, 14, 19, 18, 19, 15, 13, 20, 15, 8, 20…
## $ sending_address   <chr> "0x9d32d0bf2c00f41ce7ca01b66e174cc4dcb0c1da", "0xd6e…
## $ receiving_address <chr> "0x39f82e1c09bc6d7baccc1e79e5621ff812f50572", "0x51e…
## $ amount            <dbl> 796.9492, 0.0100, 778.1974, 300.8384, 775.5693, 590.…
## $ transaction_type  <chr> "transfer", "purchase", "purchase", "transfer", "sal…
## $ location_region   <chr> "Europe", "South America", "Asia", "South America", …
## $ ip_prefix         <dbl> 192.000, 172.000, 192.168, 172.000, 172.160, 192.168…
## $ login_frequency   <dbl> 3, 5, 3, 8, 6, 4, 8, 1, 4, 3, 8, 5, 4, 2, 6, 8, 7, 8…
## $ session_duration  <dbl> 48, 61, 74, 111, 100, 66, 103, 32, 42, 79, 85, 54, 6…
## $ purchase_pattern  <chr> "focused", "focused", "focused", "high_value", "high…
## $ age_group         <chr> "established", "established", "established", "vetera…
## $ risk_score        <dbl> 18.7500, 25.0000, 31.2500, 36.7500, 62.5000, 15.7500…
## $ anomaly           <chr> "low_risk", "low_risk", "low_risk", "low_risk", "mod…

summary(metav.data)

##   timestamp          hour_of_day    sending_address    receiving_address 
##  Length:78600       Min.   : 0.00   Length:78600       Length:78600      
##  Class :character   1st Qu.: 6.00   Class :character   Class :character  
##  Mode  :character   Median :12.00   Mode  :character   Mode  :character  
##                     Mean   :11.53                                        
##                     3rd Qu.:18.00                                        
##                     Max.   :23.00                                        
##      amount        transaction_type   location_region      ip_prefix    
##  Min.   :   0.01   Length:78600       Length:78600       Min.   : 10.0  
##  1st Qu.: 331.32   Class :character   Class :character   1st Qu.:172.0  
##  Median : 500.03   Mode  :character   Mode  :character   Median :172.2  
##  Mean   : 502.57                                         Mean   :147.6  
##  3rd Qu.: 669.53                                         3rd Qu.:192.0  
##  Max.   :1557.15                                         Max.   :192.2  
##  login_frequency session_duration purchase_pattern    age_group        
##  Min.   :1.000   Min.   : 20.00   Length:78600       Length:78600      
##  1st Qu.:2.000   1st Qu.: 35.00   Class :character   Class :character  
##  Median :4.000   Median : 60.00   Mode  :character   Mode  :character  
##  Mean   :4.179   Mean   : 69.68                                        
##  3rd Qu.:6.000   3rd Qu.:100.00                                        
##  Max.   :8.000   Max.   :159.00                                        
##    risk_score       anomaly         
##  Min.   : 15.00   Length:78600      
##  1st Qu.: 26.25   Class :character  
##  Median : 40.00   Mode  :character  
##  Mean   : 44.96                     
##  3rd Qu.: 52.50                     
##  Max.   :100.00

Data Preparation

One of the advantages of decision trees is that it requires minimal data preparation because of the algorithm’s splitting criteria. It will ignore nulls and is not influenced by non-normalized data. Support vector models unlike decision trees, require specific data preparation.

When creating a support vector machine certain preparation steps need to completed during data preparation. I will be performing the following:

Removing Nulls: We need to remove nulls because it can affect the model building and training process leading to errors and poor performance
Change data types:

Handle categorical variables: SVM does not handle categorical data. As a result, categorical values will have to be converted into a numerical format. I will be using one-hot encoding
Numerical data: Ensure that numerical variables are converted to a numerical data type.

Normalize data: If the numerical variables are not appropriately scaled then the SVM model will not have accurate predictions. Because the model relies on distance calculations, this potentially skew towards the data with the larger scale - leading to bias. To reduce this, we need to ensure the data is normalized.
Feature engineering: To get capture more valuable features from certain variables, I will create new data fields from the timestamp category. I will also be binning category variables to reduce the dimensions when using one-hot encoding.

Below I ensure that the timestamp variable is in a timestamp format, and then create the month and day_of_week columns.

# fix data column format
metav.data$timestamp <- as.POSIXct(metav.data$timestamp, format = "%m/%d/%Y %H:%M")

# create month and day of week columns
metav.data <- metav.data %>%
  mutate(
  month = as.numeric(format(timestamp, "%m")), 
  day_of_week = as.numeric(format(timestamp, "%u"))    
  )

I decide to store the values in the quarter, hour_of_day and day_of_week into bins. This help reduce the dimensionality of the dataset and general clutter when creating our models.

# bin categorical data
metav.data <- metav.data %>%
  mutate(
    quarter = cut(month,
                  breaks = c(0, 3, 6, 9, 12),
                  labels = c("Q1", "Q2", "Q3", "Q4"),
                  include.lowest = TRUE),
    time_of_day = cut(hour_of_day,
                      breaks = c(0, 5, 11, 17, 23),
                      labels = c("Night", "Morning", "Afternoon", "Evening"),
                      include.lowest = TRUE),
    day_type = case_when(
      day_of_week %in% 1:5 ~ "Week",
      day_of_week %in% 6:7 ~ "Weekend"
    )
  )

After extracting the necessary information from columns like timestamp and hour_of_day, we can now remove these variables from our dataset. I also removed other columns that aren’t of interest like sending_address, receiving_address, etc. I also perform additional data type transformations to ensure my columns are the appropriate type.

# filter out columns 
metav.data <- metav.data %>%
  select(-timestamp,-hour_of_day,-sending_address,-receiving_address,-ip_prefix,-risk_score,-month,-day_of_week) %>%
  mutate(
  amount = as.numeric(amount),
  login_frequency = as.numeric(login_frequency),
  session_duration = as.numeric(session_duration),
  anomaly = as.factor(anomaly)
  )

Next we’re going to transform our categorical variables into a numeric binary format using one-hot encoding.

# one-hot encoding
metav.data <- dummy_cols(metav.data, select_columns = 'transaction_type')
metav.data <- dummy_cols(metav.data, select_columns = 'location_region')
metav.data <- dummy_cols(metav.data, select_columns = 'purchase_pattern')
metav.data <- dummy_cols(metav.data, select_columns = 'age_group')
metav.data <- dummy_cols(metav.data, select_columns = 'time_of_day')
metav.data <- dummy_cols(metav.data, select_columns = 'quarter')
metav.data <- dummy_cols(metav.data, select_columns = 'day_type')

Normalize data

When looking at the distribution of the numeric columns (amount, login_frequency and session duration), we can see they are not normalized and have different scales. To prevent one variable from being overly influential. I will normalize all three variables.

# Prepare data for faceted plot
data_long <- reshape2::melt(metav.data[, c("amount", "login_frequency", "session_duration")])

## No id variables; using all as measure variables

# Create faceted density plot
ggplot(data_long, aes(x = value)) +
  geom_density(fill = "steelblue", alpha = 0.5) +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Density Plots for Normalized Variables") +
  xlab("Value") +
  ylab("Density")

Below we can create the nromalized data and store them in their perspective columns.

# Normalize the specified columns
preProcValues <- preProcess(metav.data[, c('amount', 'login_frequency', 'session_duration')], method = c("center", "scale"))
norm_data <- predict(preProcValues, metav.data[, c('amount', 'login_frequency', 'session_duration')])

# Replace the original columns in metav.data with the normalized columns
metav.data$amount <- norm_data$amount
metav.data$login_frequency <- norm_data$login_frequency
metav.data$session_duration <- norm_data$session_duration

Inspect nulls

We have spotted a few Nulls in the Quarter columns. Since they are a few, I decided to remove them.

# check nulls
colSums(is.na(metav.data))

##                        amount              transaction_type 
##                             0                             0 
##               location_region               login_frequency 
##                             0                             0 
##              session_duration              purchase_pattern 
##                             0                             0 
##                     age_group                       anomaly 
##                             0                             0 
##                       quarter                   time_of_day 
##                             8                             0 
##                      day_type     transaction_type_phishing 
##                             8                             0 
##     transaction_type_purchase         transaction_type_sale 
##                             0                             0 
##         transaction_type_scam     transaction_type_transfer 
##                             0                             0 
##        location_region_Africa          location_region_Asia 
##                             0                             0 
##        location_region_Europe location_region_North America 
##                             0                             0 
## location_region_South America      purchase_pattern_focused 
##                             0                             0 
##   purchase_pattern_high_value       purchase_pattern_random 
##                             0                             0 
##         age_group_established                 age_group_new 
##                             0                             0 
##             age_group_veteran             time_of_day_Night 
##                             0                             0 
##           time_of_day_Morning         time_of_day_Afternoon 
##                             0                             0 
##           time_of_day_Evening                    quarter_Q1 
##                             0                             8 
##                    quarter_Q2                    quarter_Q3 
##                             8                             8 
##                    quarter_Q4                    quarter_NA 
##                             8                             0 
##                 day_type_Week              day_type_Weekend 
##                             8                             8 
##                   day_type_NA 
##                             0

metav.data <- na.omit(metav.data)

Finalize dataset

Finally, I remove any remaining columns that are not of interest.

# filter out columns 
metav.data <- metav.data %>%
  select(-transaction_type,-location_region,-purchase_pattern,-age_group,-quarter,-time_of_day,-day_type,)

Model Building

To begin the process of building our model, we start by creating our training and testing dataset.

set.seed(123)
sample_set <- sample(nrow(metav.data), round(nrow(metav.data)*.75), replace = FALSE)
mt_train <- metav.data[sample_set, ]
mt_test <- metav.data[-sample_set, ]

Upon looking at the proportion our target variable anomaly, we can see that we are dealing with an imbalanced dataset. The class low_risk account for 80% of the original, training and testing data set. We will have to rebalance the dataset to ensure that our training model can perform better on data that it hasn’t seen before.

round(prop.table(table(select(metav.data, anomaly))),4) * 100

## anomaly
##     high_risk      low_risk moderate_risk 
##          8.26         80.78         10.96

round(prop.table(table(select(mt_train, anomaly))),4) * 100

## anomaly
##     high_risk      low_risk moderate_risk 
##          8.28         80.76         10.96

round(prop.table(table(select(mt_test, anomaly))),4) * 100

## anomaly
##     high_risk      low_risk moderate_risk 
##          8.22         80.84         10.93

Rebalancing

To rebalance the dataset, we will utilize the downSample function. Downsampling is a technique used to address class imbalance. It randomly removes samples from the majority class to match the number of data in the minority class(es). Once used, we can each class in the column is now evenly distributed.

set.seed(234)
mt_train <- downSample(x = mt_train[, -which(names(mt_train) == "anomaly")], y = mt_train$anomaly)

round(prop.table(table(select(mt_train, Class))),4) * 100

## Class
##     high_risk      low_risk moderate_risk 
##         33.33         33.33         33.33

Predictions

# Train the SVM model
svm_model <- svm(Class ~ ., data = mt_train, method = "C-classification", kernel = "radial")

# Predict on test data
predictions <- predict(svm_model, mt_test)

# Evaluate the model
confusionMatrix(predictions, mt_test$anomaly)

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      high_risk low_risk moderate_risk
##   high_risk          1616        6             0
##   low_risk              0    14679             9
##   moderate_risk         0     1199          2139
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9382          
##                  95% CI : (0.9348, 0.9415)
##     No Information Rate : 0.8084          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8331          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: high_risk Class: low_risk Class: moderate_risk
## Sensitivity                   1.00000          0.9241               0.9958
## Specificity                   0.99967          0.9976               0.9315
## Pos Pred Value                0.99630          0.9994               0.6408
## Neg Pred Value                1.00000          0.7571               0.9994
## Prevalence                    0.08225          0.8084               0.1093
## Detection Rate                0.08225          0.7471               0.1089
## Detection Prevalence          0.08255          0.7476               0.1699
## Balanced Accuracy             0.99983          0.9609               0.9636

Overall, our model scored ~94% in accuracy - a very good sign. The low p-value tells us that our model outperforms the No Information Rate (NIR), which is the accuracy that one would get if we were to predict the most frequent class in our dataset.

When looking at each class of our target variable, we see that predictions for high_risk and low_risk were highly accurate at ~99% accuracy.For transactions deemed moderate_risk, however, it didn’t perform as well with only 64% of positive predictions labeled correct.

Another detail to note is that the model performed poorly with correctly labeling transaction as non low_risk.

Decision Tree

Data Preparation

# fix date values
metav.data2$timestamp <- as.POSIXct(metav.data2$timestamp, format = "%m/%d/%Y %H:%M")

metav.data2 <- metav.data2 %>%
  mutate(
  month = as.factor(format(timestamp, "%m")), 
  day_of_week = as.factor(format(timestamp, "%u"))    
  )

# filter out columns 
metav.data2 <- metav.data2 %>%
  select(-sending_address,-receiving_address,-ip_prefix,-timestamp, -risk_score) %>%
  mutate(
  hour_of_day = as.factor(hour_of_day),
  transaction_type = as.factor(transaction_type),
  location_region = as.factor(location_region),
  purchase_pattern = as.factor(purchase_pattern),
  age_group = as.factor(age_group),
  anomaly = as.factor(anomaly)
  )

metav.data2 <- na.omit(metav.data2)

Model Building

The data is now clean and ready to be split into our training and testing dataset. looking at the proportions for our target variable anomaly, we can see that our training dataset needs to be rebalanced. We see that the label “low risk” comprises 80% of the column. This is important to address for our training set because if not rebalanced it can make predicting on data that is not of similar proportion difficult.

set.seed(411)
sample_set2 <- sample(nrow(metav.data2), round(nrow(metav.data2)*.75), replace = FALSE)
mt_train2 <- metav.data2[sample_set2, ]
mt_test2 <- metav.data2[-sample_set2, ]

round(prop.table(table(select(metav.data2, anomaly))),4) * 100

## anomaly
##     high_risk      low_risk moderate_risk 
##          8.26         80.78         10.96

round(prop.table(table(select(mt_train2, anomaly))),4) * 100

## anomaly
##     high_risk      low_risk moderate_risk 
##          8.30         80.73         10.97

round(prop.table(table(select(mt_test2, anomaly))),4) * 100

## anomaly
##     high_risk      low_risk moderate_risk 
##          8.15         80.94         10.90

Rebalancing

To rebalance the dataset, we will utilize the downSample function. Downsampling is a technique used to address class imbalance. It randomly removes samples from the majority class to match the number of data in the minority class(es). Once used, we can each class in the column is now evenly distributed.

set.seed(511)
mt_train2 <- downSample(x = mt_train2[, -which(names(mt_train2) == "anomaly")], y = mt_train2$anomaly)

round(prop.table(table(select(mt_train2, Class))),4) * 100

## Class
##     high_risk      low_risk moderate_risk 
##         33.33         33.33         33.33

When visualizing the decision tree, we’re shown that the variables the model deemed significant predictors is ‘risk_score’ and ‘transaction_type’.

mt_mod <- rpart(Class ~ ., method = "class", data = mt_train2)

## Warning: package 'rpart.plot' was built under R version 4.3.3

Predictions

When testing to see the accuracy of the decision tree, we can see that it scored 100%! Though it might sound good, this presents a problem when using this model to predict on outside data. This tell us the model is excellent at correctly classifying the high risk transactions for this dataset but will struggle performing well with data it hasn’t seen before.

mt_pred <- predict(mt_mod, mt_test2, type = "class")
mt_pred_table <- table(mt_test2$anomaly, mt_pred)
sum(diag(mt_pred_table)/nrow(mt_test2))

## [1] 0.9665106

Final Thoughts

The decision tree model outperformed the SVM model in terms of accuracy. Our SVM scored ~94%, while the decision tree scored ~97%.

The SVM model despite the non-normal distribution handled the high-dimensional dataset well and scored a high accuracy score. The decision tree was simple to set up, required less data preparation and scored a higher accuracy. For the reasons of Interpretability and simplicity, I believe that in this case the decision tree model is the superior model for classification. This also follows the recommendation of the studies from earlier.

In a further study, I would perform cross-validation to see which model performs best against unseen data.

References

Ahmad, A., Safi, O., Malebary, S., Alesawi, S., Alkayal, E.: Decision tree ensembles to predict coronavirus disease 2019 infection: a comparative study. Complexity, 1–8 (2021)
Guhathakurata S, Kundu S, Chakraborty A, Banerjee JS. A novel approach to predict COVID-19 using support vector machine. Data Science for COVID-19. 2021:351–64. doi: 10.1016/B978-0-12-824536-1.00014-9. Epub 2021 May 21. PMCID: PMC8137961.
Sahin, Y., & Duman, E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol I, IMECS 2011, March 16-18, Hong Kong. ISBN: 978-988-18210-3-4. ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online).
Jain, Y., Tiwari, N., Dubey, S., & Jain, S. (2019). A comparative analysis of various credit card fraud detection techniques. International Journal of Recent Technology and Engineering (IJRTE), 7(5S2).
Sahin, Y., Bulkan, S., & Duman, E. (2013). A cost-sensitive decision tree approach for fraud detection. Expert Syst. Appl., 40, 5916-5923.
Chen, Q. (2010). Predictive modeling for non-profit fundraising. James Madison University.

Data622_Assignment3

Christian Uriostegui

2024-04-30

Abstract

Keywords

Introduction

Algorithms

Objective

SVM Model

Data Exploration

Data Preparation

Normalize data

Inspect nulls

Finalize dataset

Model Building

Rebalancing

Predictions

Decision Tree

Data Preparation

Model Building

Rebalancing

Predictions

Final Thoughts

References