Machine learning algorithms offer robust solutions for analyzing and understanding complex datasets. Among these algorithms:
This project focuses on:
By examining both practical results and theoretical perspectives, we aim to:
Identify which algorithm performs better for specific tasks (e.g., classification vs. regression).
Understand the scenarios where each algorithm excels or faces challenges.
Provide data-driven recommendations on algorithm selection.
This structured approach bridges theoretical understanding and hands-on implementation, enhancing our grasp of these essential machine learning tools.
Objective: This article reviews machine learning applications, including SVM and Decision Trees, in addressing various challenges posed by the COVID-19 pandemic. It examines the effectiveness of these algorithms in tasks like classification of critical cases and resource allocation.
Key Insights:
(https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4237080)
Objective: The study aimed to predict audit opinions by applying data mining techniques, including Decision Trees and SVMs, to financial statement data. This is particularly valuable in identifying going-concern risks in companies.
Key Insights:
Conclusion: SVM with the RBF kernel emerged as the most effective algorithm in this study, outperforming Decision Trees in audit-related prediction tasks.
Objective: This article explored the use of machine learning techniques, including Decision Trees and SVMs, to predict auditor selection decisions based on company financial attributes and audit requirements.
Key Insights:
Conclusion: Decision Trees excelled in this specific use case due to their interpretability, but both algorithms effectively uncovered key patterns in auditor selection.
Objective: The study compared multiple machine learning algorithms, including Decision Trees and SVMs, in auditing financial statements to detect anomalies and predict outcomes.
Key Insights:
Conclusion: Decision Trees are advantageous for their efficiency and transparency, while SVMs excel in situations requiring higher accuracy, especially with complex and high-dimensional datasets.
The reviewed articles demonstrate that both Decision Trees and SVMs are valuable tools in auditing and risk management, each excelling in different areas:
Decision Trees:
Highly interpretable and efficient.
Preferred when transparency and quick classification are essential.
SVMs:
Superior for handling high-dimensional data and achieving higher accuracy.
Ideal for tasks requiring precise classification, such as predicting audit risks.
The choice between these algorithms depends on:
Dataset Characteristics: Imbalance, dimensionality, and size.
Task Objectives: The need for accuracy vs. interpretability.
Application Context: Whether results need to be explainable or optimized for performance.
library(readr)
# Import the dataset from Homework 2
merged_data <- read_csv("merged_data.csv")
## Rows: 6588 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, Risk, IncomeGroup
## dbl (4): MMR, Year, MMR_Change, PWRPC
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Rename the dataset
country_maternal_mortality_ratio <- merged_data # merged_data is the dataset from Homework 2
# Export the dataset to a CSV file
write.csv(country_maternal_mortality_ratio, "country_maternal_mortality_ratio.csv", row.names = FALSE)
# Confirm export location
#getwd()
After trying the svm for the first time i got this error:
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
Need numeric dependent variable for regression.
The error occurs because SVM requires the dependent variable
(Risk
) to be categorical for classification or numeric for
regression. Since we are performing classification (classifying
countries as “High Risk” or “Low Risk”), SVM expects the target variable
to be a factor, and non-numeric independent variables (like
IncomeGroup
) must be encoded correctly.
We resolved the issue and proceed as follow:
To make sure the Risk variable is encoded as a factor
# Convert the target variable 'Risk' to a factor
country_maternal_mortality_ratio$Risk <- as.factor(country_maternal_mortality_ratio$Risk)
IncomeGroup
is a categorical variable (e.g., “Low
income”, “Middle income”) and needs to be converted into numeric format
using one-hot encoding or dummy variables. We’ll use one-hot encoding to
handle this:
# Load the necessary library
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
# Perform one-hot encoding for 'IncomeGroup'
encoded_data <- dummy_cols(
country_maternal_mortality_ratio,
select_columns = "IncomeGroup",
remove_first_dummy = TRUE, # To avoid multicollinearity
remove_selected_columns = TRUE # To remove the original 'IncomeGroup' column
)
# Inspect the structure of the data after encoding
str(encoded_data)
## tibble [6,588 × 9] (S3: tbl_df/tbl/data.frame)
## $ Country : chr [1:6588] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ MMR : num [1:6588] 1910 1603 1587 1414 1383 ...
## $ Year : num [1:6588] 1985 1986 1987 1988 1989 ...
## $ Risk : Factor w/ 2 levels "High Risk","Low Risk": 1 1 1 1 1 1 1 1 1 1 ...
## $ MMR_Change : num [1:6588] 0 -307.4 -16.2 -172.8 -31.2 ...
## $ PWRPC : num [1:6588] 48.7 48.7 48.7 48.7 48.7 ...
## $ IncomeGroup_Low income : int [1:6588] 1 1 1 1 1 1 1 1 1 1 ...
## $ IncomeGroup_Lower middle income: int [1:6588] 0 0 0 0 0 0 0 0 0 0 ...
## $ IncomeGroup_Upper middle income: int [1:6588] 0 0 0 0 0 0 0 0 0 0 ...
Now that the data is properly encoded, we will re-run the SVM model again:
# Load necessary libraries
library(e1071)
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Scale numeric features (MMR, Year, MMR_Change, PWRPC)
scaled_data <- encoded_data %>%
mutate(across(c(MMR, Year, MMR_Change, PWRPC), scale))
# Train-test split
set.seed(123)
train_index <- createDataPartition(scaled_data$Risk, p = 0.8, list = FALSE)
train_data <- scaled_data[train_index, ]
test_data <- scaled_data[-train_index, ]
# Train the SVM model
svm_model <- svm(
Risk ~ MMR + Year + MMR_Change + PWRPC + `IncomeGroup_Low income` + `IncomeGroup_Lower middle income` + `IncomeGroup_Upper middle income`,
data = train_data,
kernel = "radial",
cost = 1,
gamma = 0.1,
probability = TRUE
)
# Make predictions on the test dataset
predictions <- predict(svm_model, test_data, probability = TRUE)
# Evaluate the model
confusion_matrix <- confusionMatrix(predictions, test_data$Risk)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Risk Low Risk
## High Risk 351 4
## Low Risk 4 958
##
## Accuracy : 0.9939
## 95% CI : (0.9881, 0.9974)
## No Information Rate : 0.7304
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9846
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9887
## Specificity : 0.9958
## Pos Pred Value : 0.9887
## Neg Pred Value : 0.9958
## Prevalence : 0.2696
## Detection Rate : 0.2665
## Detection Prevalence : 0.2696
## Balanced Accuracy : 0.9923
##
## 'Positive' Class : High Risk
##
This section compares the results of two machine learning models, Random Forest (Homework 2) and Support Vector Machine (SVM, Homework 3), applied to classify countries as “High Risk” or “Low Risk” based on maternal mortality ratio (MMR) thresholds. The objective is to assess the strengths and weaknesses of each model in terms of predictive performance.
The table below summarizes the key performance metrics for both models:
Metric | Random Forest (HW2) | SVM (HW3) |
---|---|---|
Accuracy | 1.0 | 0.9939 |
Sensitivity | 1.0 | 0.9887 |
Specificity | 1.0 | 0.9958 |
F1-Score | 1.0 | 0.9887 |
Kappa | 1.0 | 0.9846 |
Random Forest achieved perfect accuracy (1.0) across all metrics, suggesting it may be better for accuracy in this specific dataset. However, this performance could be a sign of overfitting.
SVM achieved slightly lower accuracy (0.9939) but demonstrated better generalization, making it more reliable for unseen data.
Both models are suitable for classification tasks, as demonstrated in this analysis.
SVM, however, is more versatile as it can be effectively applied to regression problems as well, whereas Random Forest is primarily used for classification and regression with structured data.
Based on the results, the recommendations align well with the observed strengths and weaknesses of the models:
Random Forest is ideal for smaller datasets where interpretability and accuracy are paramount.
SVM is preferable for larger datasets and situations requiring better generalization.
I agree with these recommendations, as SVM’s performance and generalization capabilities make it a strong choice for robust real-world applications, whereas Random Forest offers transparency and high accuracy for simpler problems.