Introduction

Machine learning algorithms offer robust solutions for analyzing and understanding complex datasets. Among these algorithms:

  • Decision Trees: Known for their simplicity and interpretability, they create models that are easy to understand and explain.
  • Support Vector Machines (SVMs): Renowned for handling high-dimensional data and capturing non-linear relationships effectively.

Project Goals

This project focuses on:

  • Exploring academic insights: Reviewing articles comparing SVMs and Decision Trees to understand their relative strengths and use cases.
  • Analyzing the dataset from Homework #2 using the SVM algorithm.
  • Comparing SVM performance with the results from prior analyses, potentially involving Decision Trees.

By examining both practical results and theoretical perspectives, we aim to:

  • Identify which algorithm performs better for specific tasks (e.g., classification vs. regression).

  • Understand the scenarios where each algorithm excels or faces challenges.

  • Provide data-driven recommendations on algorithm selection.

This structured approach bridges theoretical understanding and hands-on implementation, enhancing our grasp of these essential machine learning tools.


1. Literature Review (Articles)

1.1 Summaries of the Two Provided Articles

1.1.1 Article 1: Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection

  • Objective: This article evaluates the efficacy of decision tree ensembles in predicting COVID-19 infections based on laboratory test data. The study specifically focuses on datasets with imbalanced class distributions, a common challenge in medical diagnostics.

  • Key Insights:

    • Decision tree ensembles were found to outperform standard methods by handling imbalanced datasets effectively.
    • Sampling techniques, such as SMOTE (Synthetic Minority Oversampling Technique) and RUS (Random Undersampling), were used to mitigate class imbalance, which significantly improved the model’s predictive accuracy.
    • Performance metrics included:
      • F-measure: To balance precision and recall.
      • Precision and Recall: To gauge model sensitivity and specificity.
      • AUROC: To evaluate overall classification performance.
    • The results emphasized the robustness of decision tree ensembles, especially when combined with effective sampling methods to address imbalanced datasets. These findings are particularly relevant in healthcare, where class imbalances are often critical (e.g., identifying rare cases).

1.1.2 Article 2: Machine Learning Applications in the COVID-19 Pandemic

  • Objective: This article reviews machine learning applications, including SVM and Decision Trees, in addressing various challenges posed by the COVID-19 pandemic. It examines the effectiveness of these algorithms in tasks like classification of critical cases and resource allocation.

  • Key Insights:

    • SVM demonstrated high effectiveness in classification tasks, particularly in identifying patients requiring ICU admission. Its ability to handle non-linear decision boundaries and high-dimensional data makes it suitable for these scenarios.
    • Decision Trees, while slightly less accurate in some cases, are favored for their interpretability. They are often combined into ensemble methods, such as Random Forests and XGBoost, to enhance performance.
    • The study found that SVM outperformed Decision Trees in scenarios requiring precision, but Decision Tree ensembles excelled in handling imbalanced datasets, providing a practical advantage in healthcare data.
    • The findings highlight the complementary strengths of the two algorithms, with SVM being highly accurate and Decision Trees offering transparency and adaptability.

1.2 Additional Articles Comparing Decision Trees and SVMs in Auditing and Risk Management

1.2.1 Article 1: Audit Opinion Prediction: A Comparison of Data Mining Techniques

(https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4237080)

  • Objective: The study aimed to predict audit opinions by applying data mining techniques, including Decision Trees and SVMs, to financial statement data. This is particularly valuable in identifying going-concern risks in companies.

  • Key Insights:

    • SVM models with the Radial Basis Function (RBF) kernel achieved the highest accuracy and minimized both Type I and Type II errors, making them highly reliable for audit opinion prediction.
    • The models demonstrated excellent performance in predicting going-concern modifications, with accuracy rates ranging from 84.2% to 100%.
    • Decision Trees provided interpretable results but were slightly less accurate compared to SVM.
  • Conclusion: SVM with the RBF kernel emerged as the most effective algorithm in this study, outperforming Decision Trees in audit-related prediction tasks.

1.2.2 Article 2: Support Vector Machines, Decision Trees, and Neural Networks for Auditor Selection

(https://www.academia.edu/714333/Support_vector_machines_Decision_Trees_and_Neural_Networks_for_auditor_selection)

  • Objective: This article explored the use of machine learning techniques, including Decision Trees and SVMs, to predict auditor selection decisions based on company financial attributes and audit requirements.

  • Key Insights:

    • Decision Tree models slightly outperformed SVMs, achieving accuracy rates between 75.4% and 84%.
    • Both algorithms identified significant factors influencing auditor selection, with debt level being one of the most important predictors.
    • While SVM provided robust predictions, Decision Trees were particularly valuable for their ability to explain decision-making processes.
  • Conclusion: Decision Trees excelled in this specific use case due to their interpretability, but both algorithms effectively uncovered key patterns in auditor selection.

1.2.3 Article 3: Financial Statement Audit Utilizing Naive Bayes Networks, Decision Trees, Linear Discriminant Analysis, and Logistic Regression

(https://www.ijraset.com/best-journal/comparative-analysis-of-machine-learning-algorithms-knn-svm-decision-tree-and-logistic-regression-for-efficiency-and-performance)

  • Objective: The study compared multiple machine learning algorithms, including Decision Trees and SVMs, in auditing financial statements to detect anomalies and predict outcomes.

  • Key Insights:

    • Decision Trees were highlighted for their ability to generate interpretable rules, making them useful for explaining classification results to stakeholders.
    • SVMs were more robust in handling high-dimensional datasets, offering better accuracy in complex scenarios.
    • Decision Trees had shorter classification times, making them more efficient for quick decision-making tasks.
  • Conclusion: Decision Trees are advantageous for their efficiency and transparency, while SVMs excel in situations requiring higher accuracy, especially with complex and high-dimensional datasets.

1.3 Conclusion

The reviewed articles demonstrate that both Decision Trees and SVMs are valuable tools in auditing and risk management, each excelling in different areas:

  • Decision Trees:

    • Highly interpretable and efficient.

    • Preferred when transparency and quick classification are essential.

  • SVMs:

    • Superior for handling high-dimensional data and achieving higher accuracy.

    • Ideal for tasks requiring precise classification, such as predicting audit risks.

The choice between these algorithms depends on:

  • Dataset Characteristics: Imbalance, dimensionality, and size.

  • Task Objectives: The need for accuracy vs. interpretability.

  • Application Context: Whether results need to be explainable or optimized for performance.

2. Algorithm Implementation

2.1 Export merged_data for SVM Analysis (merged_data is the dataset from Homework 2)

library(readr)

# Import the dataset from Homework 2
merged_data <- read_csv("merged_data.csv")
## Rows: 6588 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, Risk, IncomeGroup
## dbl (4): MMR, Year, MMR_Change, PWRPC
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Rename the dataset
country_maternal_mortality_ratio <- merged_data # merged_data is the dataset from Homework 2

# Export the dataset to a CSV file
write.csv(country_maternal_mortality_ratio, "country_maternal_mortality_ratio.csv", row.names = FALSE)

# Confirm export location
#getwd()

2.3 Apply the SVM Algorithm

After trying the svm for the first time i got this error:

Error in svm.default(x, y, scale = scale, ..., na.action = na.action) : 
  Need numeric dependent variable for regression.

The error occurs because SVM requires the dependent variable (Risk) to be categorical for classification or numeric for regression. Since we are performing classification (classifying countries as “High Risk” or “Low Risk”), SVM expects the target variable to be a factor, and non-numeric independent variables (like IncomeGroup) must be encoded correctly.

We resolved the issue and proceed as follow:

2.3.1 Ensure Target Variable (Risk) is a Factor

To make sure the Risk variable is encoded as a factor

# Convert the target variable 'Risk' to a factor
country_maternal_mortality_ratio$Risk <- as.factor(country_maternal_mortality_ratio$Risk)

2.3.2 Encode Categorical Variables

IncomeGroup is a categorical variable (e.g., “Low income”, “Middle income”) and needs to be converted into numeric format using one-hot encoding or dummy variables. We’ll use one-hot encoding to handle this:

# Load the necessary library
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
# Perform one-hot encoding for 'IncomeGroup'
encoded_data <- dummy_cols(
  country_maternal_mortality_ratio,
  select_columns = "IncomeGroup",
  remove_first_dummy = TRUE,  # To avoid multicollinearity
  remove_selected_columns = TRUE  # To remove the original 'IncomeGroup' column
)

# Inspect the structure of the data after encoding
str(encoded_data)
## tibble [6,588 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Country                        : chr [1:6588] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ MMR                            : num [1:6588] 1910 1603 1587 1414 1383 ...
##  $ Year                           : num [1:6588] 1985 1986 1987 1988 1989 ...
##  $ Risk                           : Factor w/ 2 levels "High Risk","Low Risk": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MMR_Change                     : num [1:6588] 0 -307.4 -16.2 -172.8 -31.2 ...
##  $ PWRPC                          : num [1:6588] 48.7 48.7 48.7 48.7 48.7 ...
##  $ IncomeGroup_Low income         : int [1:6588] 1 1 1 1 1 1 1 1 1 1 ...
##  $ IncomeGroup_Lower middle income: int [1:6588] 0 0 0 0 0 0 0 0 0 0 ...
##  $ IncomeGroup_Upper middle income: int [1:6588] 0 0 0 0 0 0 0 0 0 0 ...

2.3.3 Re-Run SVM Algorithm

Now that the data is properly encoded, we will re-run the SVM model again:

# Load necessary libraries
library(e1071)
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: lattice
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Scale numeric features (MMR, Year, MMR_Change, PWRPC)
scaled_data <- encoded_data %>%
  mutate(across(c(MMR, Year, MMR_Change, PWRPC), scale))

# Train-test split
set.seed(123)
train_index <- createDataPartition(scaled_data$Risk, p = 0.8, list = FALSE)
train_data <- scaled_data[train_index, ]
test_data <- scaled_data[-train_index, ]

# Train the SVM model
svm_model <- svm(
  Risk ~ MMR + Year + MMR_Change + PWRPC + `IncomeGroup_Low income` + `IncomeGroup_Lower middle income` + `IncomeGroup_Upper middle income`,
  data = train_data,
  kernel = "radial",
  cost = 1,
  gamma = 0.1,
  probability = TRUE
)

# Make predictions on the test dataset
predictions <- predict(svm_model, test_data, probability = TRUE)

# Evaluate the model
confusion_matrix <- confusionMatrix(predictions, test_data$Risk)
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  High Risk Low Risk
##   High Risk       351        4
##   Low Risk          4      958
##                                           
##                Accuracy : 0.9939          
##                  95% CI : (0.9881, 0.9974)
##     No Information Rate : 0.7304          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9846          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9887          
##             Specificity : 0.9958          
##          Pos Pred Value : 0.9887          
##          Neg Pred Value : 0.9958          
##              Prevalence : 0.2696          
##          Detection Rate : 0.2665          
##    Detection Prevalence : 0.2696          
##       Balanced Accuracy : 0.9923          
##                                           
##        'Positive' Class : High Risk       
## 

3. Model Comparison: Random Forest (Homework 2) vs SVM (Homework 3)

This section compares the results of two machine learning models, Random Forest (Homework 2) and Support Vector Machine (SVM, Homework 3), applied to classify countries as “High Risk” or “Low Risk” based on maternal mortality ratio (MMR) thresholds. The objective is to assess the strengths and weaknesses of each model in terms of predictive performance.

3.1. Performance Metrics

The table below summarizes the key performance metrics for both models:

Metric Random Forest (HW2) SVM (HW3)
Accuracy 1.0 0.9939
Sensitivity 1.0 0.9887
Specificity 1.0 0.9958
F1-Score 1.0 0.9887
Kappa 1.0 0.9846

3.2 Observations and Analysis

3.2.1 Accuracy

  • Random Forest achieved perfect accuracy (1.0), whereas SVM achieved 99.39%.
  • The slight drop in SVM accuracy could indicate better generalization, as perfect accuracy might suggest overfitting in the Random Forest model.

3.2.2 Sensitivity and Specificity

  • Random Forest again showed perfect sensitivity (1.0) and specificity (1.0), meaning it identified all true positives and negatives correctly.
  • SVM, while slightly lower, achieved high sensitivity (0.9887) and specificity (0.9958), demonstrating robust performance in identifying both “High Risk” and “Low Risk” cases.

3.2.3 F1-Score

  • Random Forest outperformed SVM with a perfect F1-Score of 1.0, compared to SVM’s 0.9887. This highlights the model’s balance between precision and recall in classification tasks.

3.2.4 Kappa

  • Both models showed near-perfect agreement with the ground truth, with Random Forest achieving a Kappa value of 1.0 and SVM achieving 0.9846. The slight drop in SVM Kappa suggests minor misclassification in edge cases.

3.3 Conclusion and Recommendations

3.3.2 Is it Better for Classification or Regression Scenarios?

  • Both models are suitable for classification tasks, as demonstrated in this analysis.

  • SVM, however, is more versatile as it can be effectively applied to regression problems as well, whereas Random Forest is primarily used for classification and regression with structured data.

3.3.3 Do You Agree with the Recommendations? Why?

  • Based on the results, the recommendations align well with the observed strengths and weaknesses of the models:

    • Random Forest is ideal for smaller datasets where interpretability and accuracy are paramount.

    • SVM is preferable for larger datasets and situations requiring better generalization.

  • I agree with these recommendations, as SVM’s performance and generalization capabilities make it a strong choice for robust real-world applications, whereas Random Forest offers transparency and high accuracy for simpler problems.