HW3 Data622

Analyzing the Titanic Dataset: SVM and Decision Tree Models Comparision.

https://www.kaggle.com/competitions/titanic/data?select=test.csv

In the prior Titanic survival prediction analysis, we relied on decision trees for modeling. Now, we’ll assess the performance of both random forest decision trees and SVM models using the same dataset.

ETA Review

Similar to before, missing values in “Embarked” were replaced with the most common value, “S”. For missing fare data in the test dataset, we filled them with the mean fare. Handling missing age data involved a more complex approach. Leveraging passenger names, we inferred ages based on their titles (“Miss”, “Mrs”, etc.), enabling us to fill in missing age values. To simplify the analysis, irrelevant variables such as “Cabin”, “Ticket”, and “PassengerId” were dropped due to missing values or redundant information. Categorical variables were transformed into binary or dummy variables to aid model construction.

Comparison of Decision Tree and SVM model

In the comparison between the decision tree and SVM models, a confusion matrix was made for both models to determine accuracy, precision, recall, and F1 Score. The decision tree model exhibited an accuracy of 0.7895, indicating that approximately 78.95% of the predictions were correct, while the SVM model achieved a slightly lower accuracy of 0.7632. Precision, measuring the proportion of correctly predicted positive cases out of all predicted positive cases, was higher for the decision tree model at 0.8068 compared to the SVM model’s 0.7821. However, the decision tree model also demonstrated a higher recall of 0.8659, indicating that around 86.59% of the individuals who actually survived were correctly identified as survivors, whereas the SVM model had a slightly lower recall of 0.8537. Both models achieved high F1 scores, 0.8353 for the decision tree model and 0.8163 for the SVM model, suggesting a good balance between precision and recall. Additionally, the decision tree model had a higher AUC of 0.8572 compared to the SVM model’s 0.8189, indicating better discriminative ability.

Article Discussion

The provided articles covered the application of decision trees and SVMs, in COVID-19 prediction. The first article investigates the efficacy of different decision tree models, including single decision trees, random forests, and ensemble methods, in predicting COVID-19. It reports varying levels of performance metrics such as accuracy, precision, recall, and F1-score for each model. Conversely, the second article focuses solely on SVMs for COVID-19 prediction, achieving an accuracy of 87% with precision, recall, and F1-score reported for different infection classes.

Among the discovered articles, one explores the comparison between SVMs and decision trees for text classification, concluding that SVMs generally outperform decision trees across multiple metrics due to their capability to handle high-dimensional data and non-linear class boundaries. Conversely, the second article provides a comprehensive view of decision trees and SVMs in machine learning, emphasizing their versatility in both classification and regression tasks. Lastly, the third article delves into ensemble learning techniques, particularly with SVMs and decision trees, highlighting the concept of combining multiple models to enhance predictive performance.

Overall, these articles contribute valuable insights into the strengths and applications of decision trees and SVMs in various domains, shedding light on their performance and versatility in solving diverse machine-learning problems.

Analysis

Decision trees offer simplicity and interpretability, making them ideal for exploratory tasks and datasets with mixed variable types. They handle outliers and missing data well but may overfit complex datasets. Conversely, Support Vector Machines (SVMs) excel in capturing complex relationships in high-dimensional spaces, making them suitable for datasets with numerous features and non-linear boundaries. SVMs prioritize predictive accuracy but may lack interpretability compared to decision trees. Given that the Titanic dataset does not require extensive computational power and does not need to be overly complex, and considering that decision trees had higher accuracy, precision, recall, and F1-score, the decision tree model is likely the best choice for this data.

Conclusion

The analysis compared the performance of random forest decision trees and SVM models in predicting Titanic survival. We followed a systematic approach to handle missing data, transformed variables, and prepared the dataset for modeling.

The comparison between decision tree and SVM models revealed nuanced differences in their performance metrics. While both models achieved high accuracy, precision, recall, and F1 scores, the decision tree model exhibited slightly superior performance across most metrics. Notably, the decision tree model demonstrated higher recall and AUC, indicating better identification of survivors and improved discriminative ability compared to the SVM model.

Considering the simplicity, interpretability, and competitive performance of decision trees observed in our analysis, we conclude that the decision tree model is well-suited for the Titanic survival prediction task. Its ability to handle the dataset effectively without the need for extensive computational resources makes it a practical choice for this scenario. Overall, our analysis highlights the versatility and effectiveness of decision trees in predictive modeling tasks like Titanic survival prediction.

Code

titanic_train <- titanic_train %>%
  mutate_if(is.integer, as.factor)%>% 
  mutate_if(is.character, ~factor(replace(., . == "", NA)))

titanic_test <- titanic_test %>%
  mutate_if(is.integer, as.factor)%>% 
  mutate_if(is.character, ~factor(replace(., . == "", NA)))

titanic_train$Embarked <- replace_na(titanic_train$Embarked, "S")

Because there is only one variable missing from the test data in the Fare column, impute with the mean of the fair amount

titanic_test <- titanic_test %>%
  mutate(Fare  = if_else(is.na(Fare ), mean(Fare, na.rm = TRUE), Fare))

Decided to drop Cabinet column because there are two many missing variables and if they do help us determine one thing it’s financial class which we have Pclass for.

How to determine the Age though. Could do it by the mean but what if there is a better way.

# Create a new column 'Initial' and initialize it with 0
titanic_train$Initial <- 0

# Extract initials from 'Name' column
titanic_train$Initial <- str_extract(titanic_train$Name, "[A-Za-z]+\\.")

# If there are missing values in the 'Initial' column, you may want to handle them
# For example, if there are NA values, you can replace them with a default value
titanic_train$Initial[is.na(titanic_train$Initial)] <- "Unknown"

# Replace titles as specified
titanic_train <- titanic_train %>%
  mutate(Initial = case_when(
    Initial %in% c("Ms.", "Mlle.", "Mme.","Miss.") ~ "Miss.",
    Initial %in% c("Master.") ~ "Master.",
    Initial %in% c("Countess.", "Mrs.", "Lady.") ~ "Mrs.",
    Initial %in% c("Don.", "Sir.", "Mr.") ~ "Mr.",
    TRUE ~ "Other"
  ))

titanic_train <- titanic_train %>%
  mutate(Age = case_when(
    Initial == "Master." ~ if_else(is.na(Age), mean(Age[Initial == "Master."], na.rm = TRUE), Age),
    Initial == "Miss." ~ if_else(is.na(Age), mean(Age[Initial == "Miss."], na.rm = TRUE), Age),
    Initial == "Mr." ~ if_else(is.na(Age), mean(Age[Initial == "Mr."], na.rm = TRUE), Age),
    Initial == "Mrs." ~ if_else(is.na(Age), mean(Age[Initial == "Mrs."], na.rm = TRUE), Age),
    Initial == "Other" ~ if_else(is.na(Age), mean(Age[Initial == "Other"], na.rm = TRUE), Age),
    TRUE ~ Age
  ))

# Create a new column 'Initial' and initialize it with 0
titanic_test$Initial <- 0

# Extract initials from 'Name' column
titanic_test$Initial <- str_extract(titanic_test$Name, "[A-Za-z]+\\.")

# If there are missing values in the 'Initial' column, you may want to handle them
# For example, if there are NA values, you can replace them with a default value
titanic_test$Initial[is.na(titanic_test$Initial)] <- "Unknown"

# Replace titles as specified
titanic_test <- titanic_test %>%
  mutate(Initial = case_when(
    Initial %in% c("Ms.", "Miss.") ~ "Miss.",
    Initial %in% c("Master.") ~ "Master.",
    Initial %in% c("Dona.", "Mrs.") ~ "Mrs.",
    Initial %in% c( "Mr.") ~ "Mr.",
    TRUE ~ "Other"
  ))

titanic_test <- titanic_test %>%
  mutate(Age = case_when(
    Initial == "Master." ~ if_else(is.na(Age), mean(Age[Initial == "Master."], na.rm = TRUE), Age),
    Initial == "Miss." ~ if_else(is.na(Age), mean(Age[Initial == "Miss."], na.rm = TRUE), Age),
    Initial == "Mr." ~ if_else(is.na(Age), mean(Age[Initial == "Mr."], na.rm = TRUE), Age),
    Initial == "Mrs." ~ if_else(is.na(Age), mean(Age[Initial == "Mrs."], na.rm = TRUE), Age),
    Initial == "Other" ~ if_else(is.na(Age), mean(Age[Initial == "Other"], na.rm = TRUE), Age),
    TRUE ~ Age
  ))

titanic_train <- titanic_train %>%
  select(-c(Cabin, Ticket, Name, PassengerId, Initial)) %>%
  mutate(Sex = if_else(Sex == "female", 1, 0)) %>%
  mutate(Embarked = case_when(
    Embarked == "C" ~ 0,
    Embarked == "Q" ~ 1,
    Embarked == "S" ~ 2,
    TRUE ~ as.numeric(Embarked)
  ))%>%
  mutate_if(is.numeric, as.numeric)

titanic_test <- titanic_test %>%
  select(-c(Cabin, Ticket, Name, PassengerId,Initial)) %>%
  mutate(Sex = if_else(Sex == "female", 1, 0)) %>%
  mutate(Embarked = case_when(
    Embarked == "C" ~ 0,
    Embarked == "Q" ~ 1,
    Embarked == "S" ~ 2,
    TRUE ~ as.numeric(Embarked)
  ))%>%
  mutate_if(is.numeric, as.numeric)

set.seed(123) 
train_index <- createDataPartition(titanic_train$Survived, p = 0.7, list = FALSE)
train_data <- titanic_train[train_index, ]
test_data <- titanic_train[-train_index, ]

set.seed(123)
# Decision tree model
decision_tree_model <- randomForest(Survived ~ ., data = train_data, method = "class")

set.seed(123)
# SVM model with probability estimates
svm_model <- svm(Survived ~ ., data = train_data, kernel = "linear", probability = TRUE)

# Predictions for decision tree model
decision_tree_predictions <- predict(decision_tree_model, test_data, type = "class")

# Predictions for SVM model
svm_predictions <- predict(svm_model, test_data)

# Confusion matrix for decision tree model
decision_tree_conf_matrix <- confusionMatrix(decision_tree_predictions, test_data$Survived)

# Confusion matrix for SVM model
svm_conf_matrix <- confusionMatrix(svm_predictions, test_data$Survived)

# Print the confusion matrices
print("Confusion Matrix for Decision Tree Model:")

## [1] "Confusion Matrix for Decision Tree Model:"

print(decision_tree_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 143  35
##          1  21  67
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.7355, 0.8369)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 1.118e-09       
##                                           
##                   Kappa : 0.5429          
##                                           
##  Mcnemar's Test P-Value : 0.08235         
##                                           
##             Sensitivity : 0.8720          
##             Specificity : 0.6569          
##          Pos Pred Value : 0.8034          
##          Neg Pred Value : 0.7614          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5376          
##    Detection Prevalence : 0.6692          
##       Balanced Accuracy : 0.7644          
##                                           
##        'Positive' Class : 0               
##

print("Confusion Matrix for SVM Model:")

## [1] "Confusion Matrix for SVM Model:"

print(svm_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 140  39
##          1  24  63
##                                           
##                Accuracy : 0.7632          
##                  95% CI : (0.7074, 0.8129)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : 2.657e-07       
##                                           
##                   Kappa : 0.4848          
##                                           
##  Mcnemar's Test P-Value : 0.07776         
##                                           
##             Sensitivity : 0.8537          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7821          
##          Neg Pred Value : 0.7241          
##              Prevalence : 0.6165          
##          Detection Rate : 0.5263          
##    Detection Prevalence : 0.6729          
##       Balanced Accuracy : 0.7357          
##                                           
##        'Positive' Class : 0               
##

# Calculate accuracy, precision, recall, and F1 score for decision tree model
decision_tree_accuracy <- decision_tree_conf_matrix$overall['Accuracy']
decision_tree_precision <- decision_tree_conf_matrix$byClass['Precision']
decision_tree_recall <- decision_tree_conf_matrix$byClass['Recall']
decision_tree_f1_score <- decision_tree_conf_matrix$byClass['F1']

# Calculate accuracy, precision, recall, and F1 score for SVM model
svm_accuracy <- svm_conf_matrix$overall['Accuracy']
svm_precision <- svm_conf_matrix$byClass['Precision']
svm_recall <- svm_conf_matrix$byClass['Recall']
svm_f1_score <- svm_conf_matrix$byClass['F1']


cat("Decision Tree Model Metrics:\n")

## Decision Tree Model Metrics:

cat("Accuracy:", decision_tree_accuracy, "\n")

## Accuracy: 0.7894737

cat("Precision:", decision_tree_precision, "\n")

## Precision: 0.8033708

cat("Recall:", decision_tree_recall, "\n")

## Recall: 0.8719512

cat("F1 Score:", decision_tree_f1_score, "\n")

## F1 Score: 0.8362573

cat("\nSVM Model Metrics:\n")

## 
## SVM Model Metrics:

cat("Accuracy:", svm_accuracy, "\n")

## Accuracy: 0.7631579

cat("Precision:", svm_precision, "\n")

## Precision: 0.7821229

cat("Recall:", svm_recall, "\n")

## Recall: 0.8536585

cat("F1 Score:", svm_f1_score, "\n")

## F1 Score: 0.8163265

# Predicted probabilities for decision tree model
decision_tree_probabilities <- predict(decision_tree_model, test_data, type = "prob")
decision_tree_probabilities <- decision_tree_probabilities[,2]  # Use probabilities of positive class

# Predicted probabilities for SVM model
svm_probabilities <- attr(predict(svm_model, test_data, probability = TRUE), "probabilities")[,2]  # Use probabilities of positive class

# Calculate ROC curve for decision tree model
decision_tree_roc <- roc(test_data$Survived, decision_tree_probabilities)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Calculate ROC curve for SVM model
svm_roc <- roc(test_data$Survived, svm_probabilities)

## Setting levels: control = 0, case = 1

## Setting direction: controls > cases

# Plot ROC curves
plot(decision_tree_roc, col = "blue", main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate")
lines(svm_roc, col = "red")
legend("bottomright", legend = c("Decision Tree", "SVM"), col = c("blue", "red"), lty = 1)

# Calculate AUC for decision tree model
decision_tree_auc <- auc(decision_tree_roc)

# Calculate AUC for SVM model
svm_auc <- auc(svm_roc)

# Print AUC values
cat("AUC for Decision Tree Model:", decision_tree_auc, "\n")

## AUC for Decision Tree Model: 0.8523434

cat("AUC for SVM Model:", svm_auc, "\n")

## AUC for SVM Model: 0.8188965

Resources:

Ahmad, A., Safi, O., Malebary, S., Alesawi, S., & Alkayal, E. (2021). Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study. Journal: Complexity (New York, N.Y.) Volume: 2021 Pages: 1–8 DOI: 10.1155/2021/5550344
Guhathakurata S, Kundu S, Chakraborty A, Banerjee JS. A novel approach to predict COVID-19 using support vector machine. Book: Data Science for COVID-19 Year: 2021 Pages: 351–64 DOI: 10.1016/B978-0-12-824536-1.00014-9
Comparing Support Vector Machines and Decision Trees for Text Classification Author: GeeksforGeeks Link: https://www.geeksforgeeks.org/comparing-support-vector-machines-and-decision-trees-for-text-classification/
A Complete View of Decision Trees and SVM in Machine Learning Author: Towards Data Science Link: https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b
Ensemble Learning with SVM and Decision Trees Author: GeeksforGeeks Link: https://www.geeksforgeeks.org/ensemble-learning-with-svm-and-decision-trees/