Introduction

Decision Tree vs. Support Vector Machines (SVM)

In my project, I compared the performance of Decision Trees and Support Vector Machines (SVM) for classifying the Titanic dataset and discussed the impact of class imbalance on the model’s performance.

A Decision Tree is a hierarchical model that splits data based on feature thresholds, making it easy to interpret and suitable for categorical data and nonlinear relationships. In my case, both Decision Trees achieved moderately high accuracy, with the first tree scoring 79.78% and the second tree scoring 78.65%. While Decision Trees are quick to train and test, they have some challenges, such as overfitting, especially when the tree grows too deep, and instability, where small changes in the data can lead to different models. Despite these challenges, Decision Trees can perform well on simpler datasets, as they provide clear decision rules and feature importance insights.

On the other hand, Support Vector Machines (SVM) are powerful models that identify a hyperplane to separate data points into classes. With different kernel options, SVM can handle both linearly and non-linearly separable data. I tested four kernels—linear, polynomial, radial, and sigmoid—and found that the Radial Kernel achieved the highest accuracy of 78.73% and an AUC score of 0.8111. While SVM is generally more robust and less prone to overfitting than Decision Trees, it requires careful tuning of hyperparameters and has a high computational cost, especially for large datasets. The Sigmoid Kernel performed poorly, indicating that it was not suitable for this particular dataset.

Regarding which algorithm is recommended for more accurate results, I agree that SVM with the Radial Kernel is the better choice, particularly for complex datasets. It achieved the highest performance in terms of accuracy and AUC, and its ability to capture non-linear relationships gives it an edge over Decision Trees, which can be more prone to overfitting. Additionally, SVM’s robustness and better generalization properties make it less susceptible to overfitting, especially in noisy or intricate datasets.

In terms of whether these algorithms are better for classification or regression, Decision Trees are primarily used for classification, though they can be adapted for regression tasks. SVM, while commonly used for classification, can also be used for regression (SVR), but it can be more computationally expensive compared to simpler models.

I do agree with the recommendation to use SVM with the Radial Kernel, especially in situations where complex, non-linear relationships exist and generalization is key. However, if interpretability is crucial, Decision Trees might be a better choice, as they provide clear, understandable decision rules.

Lastly, in my project, I encountered the issue of overfitting due to class imbalance in the Survived category of the dataset. With the Not Survived (0) class outweighing the Survived (1) class, the model became biased toward predicting the majority class. This led to overfitting, as the model focused more on the majority class and ignored the minority class, causing it to fail in generalizing to the real-world data. This highlights the importance of addressing class imbalance to ensure a more balanced and accurate model.

Scholarly Articles

Article 1 : https://towardsdatascience.com/a-complete-view-of-decision-trees-and-svm-in-machine-learning-f9f3d19a337b Article 2 : https://www.researchgate.net/publication/221579383_Support_Vector_Machines_for_Classification Article 3 : https://www.nature.com/articles/s41598-023-47174-w

Dataset

df <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Project%202/Titanic-Dataset.csv")
head(df)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

I will now doing some data visualization to get a better understanding of the data and its relationship.

# 1. Survival Count
ggplot(df, aes(x = factor(Survived), fill = factor(Survived))) +
  geom_bar() +
  labs(title = "Survival Count", x = "Survived (0 = No, 1 = Yes)", y = "Count") +
  scale_fill_manual(values = c("red", "green"), name = "Survival Status") +
  theme_minimal()

# 2. Passenger Class Distribution by Survival
ggplot(df, aes(x = factor(Pclass), fill = factor(Survived))) +
  geom_bar(position = "fill") +
  labs(title = "Passenger Class Distribution by Survival", x = "Passenger Class", y = "Proportion") +
  scale_fill_manual(values = c("red", "green"), name = "Survival Status") +
  theme_minimal()

# 3. Age Distribution by Survival
ggplot(df, aes(x = Age, fill = factor(Survived))) +
  geom_histogram(binwidth = 5, position = "identity", alpha = 0.6) +
  labs(title = "Age Distribution by Survival", x = "Age", y = "Count") +
  scale_fill_manual(values = c("red", "green"), name = "Survival Status") +
  theme_minimal()

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

# 4. Fare Distribution by Survival
ggplot(df, aes(x = factor(Survived), y = Fare, fill = factor(Survived))) +
  geom_boxplot() +
  labs(title = "Fare Distribution by Survival", x = "Survived (0 = No, 1 = Yes)", y = "Fare") +
  scale_fill_manual(values = c("red", "green"), name = "Survival Status") +
  theme_minimal()

# 5. Port of Embarkation by Survival
ggplot(df, aes(x = Embarked, fill = factor(Survived))) +
  geom_bar(position = "dodge") +
  labs(title = "Port of Embarkation by Survival", x = "Port of Embarkation", y = "Count") +
  scale_fill_manual(values = c("red", "green"), name = "Survival Status") +
  theme_minimal()

I will be taking out the columns that I think wouldn’t be useful in the model

df <- subset(df, select = -c(PassengerId, Name, Ticket,Pclass, Cabin))

Exploratory Data Analysis

summary(df)

##     Survived          Sex                 Age            SibSp      
##  Min.   :0.0000   Length:891         Min.   : 0.42   Min.   :0.000  
##  1st Qu.:0.0000   Class :character   1st Qu.:20.12   1st Qu.:0.000  
##  Median :0.0000   Mode  :character   Median :28.00   Median :0.000  
##  Mean   :0.3838                      Mean   :29.70   Mean   :0.523  
##  3rd Qu.:1.0000                      3rd Qu.:38.00   3rd Qu.:1.000  
##  Max.   :1.0000                      Max.   :80.00   Max.   :8.000  
##                                      NA's   :177                    
##      Parch             Fare          Embarked        
##  Min.   :0.0000   Min.   :  0.00   Length:891        
##  1st Qu.:0.0000   1st Qu.:  7.91   Class :character  
##  Median :0.0000   Median : 14.45   Mode  :character  
##  Mean   :0.3816   Mean   : 32.20                     
##  3rd Qu.:0.0000   3rd Qu.: 31.00                     
##  Max.   :6.0000   Max.   :512.33                     
##

str(df)

## 'data.frame':    891 obs. of  7 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: chr  "S" "C" "S" "S" ...

Now I will be fixing missing data

missing_values <- colSums(is.na(df))
print(missing_values)

## Survived      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0      177        0        0        0        0

df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)

df$Survived <- factor(df$Survived)
#df$Pclass <- factor(df$Pclass)
df$Sex <- factor(df$Sex)
#df$Embarked <- factor(df$Embarked)

# Convert categorical variables to numeric
df$Sex <- ifelse(df$Sex == "female", 1, 0) # 1 for female, 0 for male
df$Embarked <- as.numeric(factor(df$Embarked)) # Encode 'Embarked' as numeric factors

# Splitting the data into training and testing sets
set.seed(592)

split <- sample.split(df$Survived, SplitRatio = 0.7)
train_set <- subset(df,split == TRUE)
test_set  <- subset(df,split == FALSE)

I will be fitting the model using the training set and then testing it on the testing set. I will also be looking at important features in the model.

# Fit a linear SVM model
svm_model <- svm(Survived ~ ., data = train_set, kernel = "linear", scale = TRUE)

# Extract and view the coefficients
coef_matrix <- t(svm_model$coefs) %*% svm_model$SV
importance <- abs(coef_matrix)
importance <- importance / sum(importance)  # Normalize importance scores

# Print importance
importance

##           Sex          Age        SibSp        Parch         Fare     Embarked
## [1,] 0.999183 0.0001077892 0.0004249355 5.747648e-05 0.0001020406 0.0001247935

# Subset numeric columns
numeric_df <- df[sapply(df, is.numeric)]

# Compute the correlation matrix
cor_matrix <- cor(numeric_df, use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "full", 
         col = colorRampPalette(c("red", "white", "blue"))(200), 
         tl.col = "black", tl.cex = 0.8, 
         addCoef.col = "black", # Add correlation coefficients
         number.cex = 0.7,      # Text size for coefficients
         main = "Customized Correlation Matrix")

# Model Evaluation

library(e1071)   # For SVM
library(caret)   # For confusion matrix and additional metrics



# Function to evaluate model accuracy and confusion matrix
evaluate_model <- function(model, test_data, true_labels) {
  # Make predictions
  predictions <- predict(model, test_data)
  
  # Calculate accuracy
  accuracy <- mean(predictions == true_labels)
  
  # Confusion matrix
  cm <- confusionMatrix(predictions, true_labels)
  
  # Print metrics
  print(cm)
  cat("Accuracy:", accuracy, "\n")
  
  # Return results as a list
  return(list(accuracy = accuracy, confusion_matrix = cm))
}

Linear Kernel SVM Model Summary:

# SVM model with linear kernel
svm_model_linear <- svm(Survived ~ ., 
                        data = train_set, 
                        type = "C-classification", 
                        cost = 100, 
                        kernel = "linear", 
                        scale = FALSE)

cat("\nLinear Kernel SVM Model Summary:\n")

## 
## Linear Kernel SVM Model Summary:

print(svm_model_linear)

## 
## Call:
## svm(formula = Survived ~ ., data = train_set, type = "C-classification", 
##     cost = 100, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  100 
## 
## Number of Support Vectors:  270

# Evaluate model performance
evaluation_results <- evaluate_model(svm_model_linear, test_set, test_set$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 132  31
##          1  33  72
##                                          
##                Accuracy : 0.7612         
##                  95% CI : (0.7055, 0.811)
##     No Information Rate : 0.6157         
##     P-Value [Acc > NIR] : 3.004e-07      
##                                          
##                   Kappa : 0.4972         
##                                          
##  Mcnemar's Test P-Value : 0.9005         
##                                          
##             Sensitivity : 0.8000         
##             Specificity : 0.6990         
##          Pos Pred Value : 0.8098         
##          Neg Pred Value : 0.6857         
##              Prevalence : 0.6157         
##          Detection Rate : 0.4925         
##    Detection Prevalence : 0.6082         
##       Balanced Accuracy : 0.7495         
##                                          
##        'Positive' Class : 0              
##                                          
## Accuracy: 0.761194

cat("Linear Kernel Accuracy:", evaluation_results$accuracy, "\n")

## Linear Kernel Accuracy: 0.761194

Polynomial Kernel SVM Model Summary:

# Polynomial Kernel SVM Model
svm_model_poly <- svm(Survived ~ ., 
                      data = train_set, 
                      kernel = "polynomial", 
                      cost = 100, 
                      scale = TRUE)

cat("\nPolynomial Kernel SVM Model Summary:\n")

## 
## Polynomial Kernel SVM Model Summary:

print(svm_model_poly)

## 
## Call:
## svm(formula = Survived ~ ., data = train_set, kernel = "polynomial", 
##     cost = 100, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  100 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  244

# Evaluate model performance for Polynomial Kernel
evaluation_results_poly <- evaluate_model(svm_model_poly, test_set, test_set$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 132  30
##          1  33  73
##                                           
##                Accuracy : 0.7649          
##                  95% CI : (0.7095, 0.8144)
##     No Information Rate : 0.6157          
##     P-Value [Acc > NIR] : 1.474e-07       
##                                           
##                   Kappa : 0.506           
##                                           
##  Mcnemar's Test P-Value : 0.8011          
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.7087          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.6887          
##              Prevalence : 0.6157          
##          Detection Rate : 0.4925          
##    Detection Prevalence : 0.6045          
##       Balanced Accuracy : 0.7544          
##                                           
##        'Positive' Class : 0               
##                                           
## Accuracy: 0.7649254

cat("Polynomial Kernel Accuracy:", evaluation_results_poly$accuracy, "\n")

## Polynomial Kernel Accuracy: 0.7649254

Radial Kernel SVM Model Summary:

# Train Radial Kernel SVM Model
svm_model_radial <- svm(Survived ~ ., 
                        data = train_set, 
                        kernel = "radial", 
                        scale = TRUE)

# Model summary
summary(svm_model_radial)

## 
## Call:
## svm(formula = Survived ~ ., data = train_set, kernel = "radial", 
##     scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  283
## 
##  ( 143 140 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

# Evaluate the radial kernel model
evaluation_results_radial <- evaluate_model(svm_model_radial, test_set, test_set$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 140  32
##          1  25  71
##                                           
##                Accuracy : 0.7873          
##                  95% CI : (0.7334, 0.8347)
##     No Information Rate : 0.6157          
##     P-Value [Acc > NIR] : 1.337e-09       
##                                           
##                   Kappa : 0.5448          
##                                           
##  Mcnemar's Test P-Value : 0.4268          
##                                           
##             Sensitivity : 0.8485          
##             Specificity : 0.6893          
##          Pos Pred Value : 0.8140          
##          Neg Pred Value : 0.7396          
##              Prevalence : 0.6157          
##          Detection Rate : 0.5224          
##    Detection Prevalence : 0.6418          
##       Balanced Accuracy : 0.7689          
##                                           
##        'Positive' Class : 0               
##                                           
## Accuracy: 0.7873134

# Accessing the results
cat("Radial Kernel Accuracy:", evaluation_results_radial$accuracy, "\n")

## Radial Kernel Accuracy: 0.7873134

Sigmoid Kernel SVM Model Summary:

# Sigmoid Kernel SVM Model
svm_model_sigmoid <- svm(Survived ~ ., 
                         data = train_set, 
                         kernel = "sigmoid", 
                         cost = 1, 
                         scale = TRUE)

cat("\nSigmoid Kernel SVM Model Summary:\n")

## 
## Sigmoid Kernel SVM Model Summary:

summary(svm_model_sigmoid)

## 
## Call:
## svm(formula = Survived ~ ., data = train_set, kernel = "sigmoid", 
##     cost = 1, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  242
## 
##  ( 120 122 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

# Evaluate model performance for Sigmoid Kernel
evaluation_results_sigmoid <- evaluate_model(svm_model_sigmoid, test_set, test_set$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 111  39
##          1  54  64
##                                           
##                Accuracy : 0.653           
##                  95% CI : (0.5927, 0.7099)
##     No Information Rate : 0.6157          
##     P-Value [Acc > NIR] : 0.1160          
##                                           
##                   Kappa : 0.2863          
##                                           
##  Mcnemar's Test P-Value : 0.1466          
##                                           
##             Sensitivity : 0.6727          
##             Specificity : 0.6214          
##          Pos Pred Value : 0.7400          
##          Neg Pred Value : 0.5424          
##              Prevalence : 0.6157          
##          Detection Rate : 0.4142          
##    Detection Prevalence : 0.5597          
##       Balanced Accuracy : 0.6470          
##                                           
##        'Positive' Class : 0               
##                                           
## Accuracy: 0.6529851

cat("Sigmoid Kernel Accuracy:", evaluation_results_sigmoid$accuracy, "\n")

## Sigmoid Kernel Accuracy: 0.6529851

AUC for SVM Models

# AUC for Linear Kernel
decision_values_linear <- as.numeric(attr(predict(svm_model_linear, test_set, decision.values = TRUE), "decision.values"))
roc_linear <- roc(test_set$Survived, decision_values_linear, levels = c(0, 1))

## Setting direction: controls < cases

auc_linear <- auc(roc_linear)

# AUC for Polynomial Kernel
decision_values_poly <- as.numeric(attr(predict(svm_model_poly, test_set, decision.values = TRUE), "decision.values"))
roc_poly <- roc(test_set$Survived, decision_values_poly, levels = c(0, 1))

## Setting direction: controls < cases

auc_poly <- auc(roc_poly)

# AUC for Radial Kernel
decision_values_radial <- as.numeric(attr(predict(svm_model_radial, test_set, decision.values = TRUE), "decision.values"))
roc_radial <- roc(test_set$Survived, decision_values_radial, levels = c(0, 1))

## Setting direction: controls < cases

auc_radial <- auc(roc_radial)

# AUC for Sigmoid Kernel
decision_values_sigmoid <- as.numeric(attr(predict(svm_model_sigmoid, test_set, decision.values = TRUE), "decision.values"))
roc_sigmoid <- roc(test_set$Survived, decision_values_sigmoid, levels = c(0, 1))

## Setting direction: controls < cases

auc_sigmoid <- auc(roc_sigmoid)

# Print AUC values
cat("Linear Kernel AUC:", auc_linear, "\n")

## Linear Kernel AUC: 0.8387467

cat("Polynomial Kernel AUC:", auc_poly, "\n")

## Polynomial Kernel AUC: 0.7734334

cat("Radial Kernel AUC:", auc_radial, "\n")

## Radial Kernel AUC: 0.8110915

cat("Sigmoid Kernel AUC:", auc_sigmoid, "\n")

## Sigmoid Kernel AUC: 0.6589879

# Plot ROC Curves
par(mfrow = c(2, 2))  # Set up plotting area

plot(roc_linear, main = "ROC Curve - Linear Kernel", col = "blue", lwd = 2)
plot(roc_poly, main = "ROC Curve - Polynomial Kernel", col = "red", lwd = 2)
plot(roc_radial, main = "ROC Curve - Radial Kernel", col = "green", lwd = 2)
plot(roc_sigmoid, main = "ROC Curve - Sigmoid Kernel", col = "purple", lwd = 2)

par(mfrow = c(1, 1))  # Reset plotting area

Decision Tree Model

Decision Tree Model 1

df <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Project%202/Titanic-Dataset.csv")
missing_values <- colSums(is.na(df))
print(missing_values)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

# Splitting the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(df$Survived, p = .7, list = FALSE)
train_set <- df[ trainIndex,]
test_set  <- df[-trainIndex,]

# Decision Tree Model 1 (as per your existing code)




tree_model1 <- rpart(Survived ~ Pclass + Sex + Age + Fare, 
                      data = train_set, method = "class")
rpart.plot(tree_model1)

# Prediction and evaluation for Decision Tree 1
predictions1 <- predict(tree_model1, test_set, type = "class")

# Confusion matrix for Decision Tree 1
confusion_matrix1 <- table(test_set$Survived, predictions1)
accuracy1 <- sum(diag(confusion_matrix1)) / sum(confusion_matrix1)

# Print accuracy for the first decision tree
confusion_matrix1

##    predictions1
##       0   1
##   0 146  21
##   1  33  67

print(paste("Accuracy of first decision tree:", round(accuracy1 * 100, 2), "%"))

## [1] "Accuracy of first decision tree: 79.78 %"

tree_model2 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch, 
                      data = train_set, method = "class")

rpart.plot(tree_model2)

# Prediction using test set
predictions2 <- predict(tree_model2, test_set, type = "class")

# Confusion matrix 
confusion_matrix2 <- table(test_set$Survived, predictions2)
accuracy2 <- sum(diag(confusion_matrix2)) / sum(confusion_matrix2)

confusion_matrix2

##    predictions2
##       0   1
##   0 140  27
##   1  30  70

print(paste("Accuracy of second decision tree:", round(accuracy2 * 100, 2), "%"))

## [1] "Accuracy of second decision tree: 78.65 %"

Conclusion

In this project, I compared the performance of Decision Trees and Support Vector Machines (SVM) on the Titanic dataset to determine which algorithm would provide more accurate results for binary classification. Both models showed good performance, but SVM with the Radial Kernel outperformed Decision Trees, achieving the highest accuracy and AUC scores. The Radial Kernel’s ability to handle complex, non-linear relationships and its superior generalization properties made it a better choice for this dataset, particularly for datasets with intricate boundaries between classes.

While Decision Trees are useful for simpler datasets and offer easy interpretability, they are more prone to overfitting and instability, especially with deeper trees. SVM, on the other hand, is more robust and less prone to overfitting, making it a better option for more complex data.

Additionally, I encountered the issue of overfitting due to class imbalance in the Survived category. The model’s bias toward the majority class (Not Survived) resulted in poor generalization for the minority class (Survived). Addressing class imbalance through techniques like resampling or adjusting class weights is crucial to prevent such overfitting and ensure better model performance for both classes.

Overall, while SVM with the Radial Kernel is recommended for this dataset, the choice of algorithm should always consider the specific dataset characteristics and the importance of model interpretability versus accuracy.

Data 622 Project 3 Final

Mikhail Broomes

2024-11-24