Analyzing the Titanic Dataset: Decision Tree and Random Forest Models

The decision tree chosen for analysis pertains to a Kaggle competition aimed at predicting which passengers survived the Titanic shipwreck. This particular dataset is divided into a training set and a testing set. Initially, a thorough examination of the variables’ summary provided insight into the nature of the data. This step involved identifying missing values, determining whether variables were continuous or categorical, and devising strategies to handle missing data.

One such strategy involved imputing missing values for the “Embarked” variable by replacing them with the highest frequency category, which was “S”. Similarly, missing fare data in the “titanic_test” dataset were replaced with the mean fare amount since fare was a continuous variable. Addressing missing age data required a more nuanced approach. Leveraging feature engineering techniques, the passengers’ names were used to infer their ages. By analyzing passengers’ titles (“Miss”, “Mrs”, etc.), average age ranges were determined, enabling the imputation of missing age values.

To streamline the analysis process, certain variables were deemed irrelevant and consequently dropped from consideration. For instance, the “Cabin” variable was discarded due to a high proportion of missing values, while “Ticket” and “PassengerId” were excluded as they served as identifier variables. Categorical variables were transformed into binary or dummy variables to facilitate the construction of decision tree and random forest models.

Following data preprocessing, a correlation matrix was computed to ensure no issues of collinearity existed among variables. This step is crucial as collinearity can adversely impact the performance of predictive models. Upon confirming the absence of collinearity, attention shifted to identifying the most influential variables. Sex and class emerged as pivotal predictors of survival, as evidenced by their strong correlations with the outcome variable.

To assess model performance, two decision tree models were built, one without the most important variable (sex) and another incorporating all variables. This comparative analysis allowed for the evaluation of bias and variance between the two models. The observed bias difference of 0.09022556 indicated a significant disparity in performance, suggesting that one model consistently outperformed the other across the testing dataset. Additionally, the variance of 0.09022556 highlighted substantial variability in predictions, indicating significant deviations from the mean accuracy of both models.

Subsequently, a random forest model was constructed, yielding the highest accuracy of all models, with an accuracy score of 0.8947. Random forest models are known for their robustness and ability to handle complex datasets, making them particularly well-suited for predictive tasks such as survival prediction in the Titanic dataset.

The analysis of the Titanic dataset using decision tree and random forest models not only highlights their efficacy but also underscores the importance of addressing various concerns associated with their usage. When dealing with complex topics, such as predicting passenger survival, decision trees may become large and unwieldy, posing challenges in maintenance and version control. To tackle this issue, it is crucial to leverage tools that enable version control of decision trees, facilitating collaboration and ensuring the orderly evolution of models over time.

In conclusion, the analysis of the Titanic dataset showcased the effectiveness of decision tree and random forest models in predicting passenger survival. By leveraging feature engineering, preprocessing techniques, and model evaluation strategies, valuable insights were gleaned regarding the influential factors determining survival outcomes. These findings underscore the importance of robust data analysis methodologies in deriving actionable insights from complex datasets.

https://www.kaggle.com/competitions/titanic/data?select=test.csv

titanic_train <- titanic_train %>%
  mutate_if(is.integer, as.factor)%>% 
  mutate_if(is.character, ~factor(replace(., . == "", NA)))

titanic_test <- titanic_test %>%
  mutate_if(is.integer, as.factor)%>% 
  mutate_if(is.character, ~factor(replace(., . == "", NA)))
summary(titanic_train)
##   PassengerId  Survived Pclass                                     Name    
##  1      :  1   0:549    1:216   Abbing, Mr. Anthony                  :  1  
##  2      :  1   1:342    2:184   Abbott, Mr. Rossmore Edward          :  1  
##  3      :  1            3:491   Abbott, Mrs. Stanton (Rosa Hunt)     :  1  
##  4      :  1                    Abelson, Mr. Samuel                  :  1  
##  5      :  1                    Abelson, Mrs. Samuel (Hannah Wizosky):  1  
##  6      :  1                    Adahl, Mr. Mauritz Nils Martin       :  1  
##  (Other):885                    (Other)                              :885  
##      Sex           Age        SibSp   Parch        Ticket         Fare       
##  female:314   Min.   : 0.42   0:608   0:678   1601    :  7   Min.   :  0.00  
##  male  :577   1st Qu.:20.12   1:209   1:118   347082  :  7   1st Qu.:  7.91  
##               Median :28.00   2: 28   2: 80   CA. 2343:  7   Median : 14.45  
##               Mean   :29.70   3: 16   3:  5   3101295 :  6   Mean   : 32.20  
##               3rd Qu.:38.00   4: 18   4:  4   347088  :  6   3rd Qu.: 31.00  
##               Max.   :80.00   5:  5   5:  5   CA 2144 :  6   Max.   :512.33  
##               NA's   :177     8:  7   6:  1   (Other) :852                   
##          Cabin     Embarked  
##  B96 B98    :  4   C   :168  
##  C23 C25 C27:  4   Q   : 77  
##  G6         :  4   S   :644  
##  C22 C26    :  3   NA's:  2  
##  D          :  3             
##  (Other)    :186             
##  NA's       :687
summary(titanic_test)
##   PassengerId  Pclass                                         Name    
##  892    :  1   1:107   Abbott, Master. Eugene Joseph            :  1  
##  893    :  1   2: 93   Abelseth, Miss. Karen Marie              :  1  
##  894    :  1   3:218   Abelseth, Mr. Olaus Jorgensen            :  1  
##  895    :  1           Abrahamsson, Mr. Abraham August Johannes :  1  
##  896    :  1           Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1  
##  897    :  1           Aks, Master. Philip Frank                :  1  
##  (Other):412           (Other)                                  :412  
##      Sex           Age        SibSp       Parch          Ticket   
##  female:152   Min.   : 0.17   0:283   0      :324   PC 17608:  5  
##  male  :266   1st Qu.:21.00   1:110   1      : 52   113503  :  4  
##               Median :27.00   2: 14   2      : 33   CA. 2343:  4  
##               Mean   :30.27   3:  4   3      :  3   16966   :  3  
##               3rd Qu.:39.00   4:  4   4      :  2   220845  :  3  
##               Max.   :76.00   5:  1   9      :  2   347077  :  3  
##               NA's   :86      8:  2   (Other):  2   (Other) :396  
##       Fare                     Cabin     Embarked
##  Min.   :  0.000   B57 B59 B63 B66:  3   C:102   
##  1st Qu.:  7.896   A34            :  2   Q: 46   
##  Median : 14.454   B45            :  2   S:270   
##  Mean   : 35.627   C101           :  2           
##  3rd Qu.: 31.500   C116           :  2           
##  Max.   :512.329   (Other)        : 80           
##  NA's   :1         NA's           :327

Inputting missing values: Embarked:

ggplot(titanic_train, aes(x = factor(Embarked))) +
  geom_bar() +
  labs(title = "Embarked Count", x = "Embarked", y = "Count")   



Since there are a lot of passenger embarking from S and only 2 Na embarked values they will be replaced by S.

titanic_train$Embarked <- replace_na(titanic_train$Embarked, "S")

Because there is only one variable missing from the test data in the Fare column, impute with the mean of the fair amount

titanic_test <- titanic_test %>%
  mutate(Fare  = if_else(is.na(Fare ), mean(Fare, na.rm = TRUE), Fare))

Decided to drop Cabinet column because there are two many missing variables and if they do help us determine one thing it’s financial class which we have Pclass for.

How to determine the Age though. Could do it by the mane but what if there is a better way.

# Create a new column 'Initial' and initialize it with 0
titanic_train$Initial <- 0

# Extract initials from 'Name' column
titanic_train$Initial <- str_extract(titanic_train$Name, "[A-Za-z]+\\.")

# If there are missing values in the 'Initial' column, you may want to handle them
# For example, if there are NA values, you can replace them with a default value
titanic_train$Initial[is.na(titanic_train$Initial)] <- "Unknown"
ggplot(titanic_train, aes(x = Initial, fill = factor(Sex))) +
  geom_bar(position = "dodge") +
  labs(title = "Initial: Sex", x = "Sex", y = "Count") +
  scale_fill_manual(values = c("male" = "skyblue", "female" = "salmon"))

# Calculate age range for each initial
initial_age_range <- titanic_train %>%
  group_by(Initial) %>%
  summarize(Count = n(),
            Min_Age = min(Age, na.rm = TRUE),
            Max_Age = max(Age, na.rm = TRUE),
            Average = mean(Age, na.rm = TRUE))

# Print the result
print(initial_age_range)
## # A tibble: 17 × 5
##    Initial   Count Min_Age Max_Age Average
##    <chr>     <int>   <dbl>   <dbl>   <dbl>
##  1 Capt.         1   70         70   70   
##  2 Col.          2   56         60   58   
##  3 Countess.     1   33         33   33   
##  4 Don.          1   40         40   40   
##  5 Dr.           7   23         54   42   
##  6 Jonkheer.     1   38         38   38   
##  7 Lady.         1   48         48   48   
##  8 Major.        2   45         52   48.5 
##  9 Master.      40    0.42      12    4.57
## 10 Miss.       182    0.75      63   21.8 
## 11 Mlle.         2   24         24   24   
## 12 Mme.          1   24         24   24   
## 13 Mr.         517   11         80   32.4 
## 14 Mrs.        125   14         63   35.9 
## 15 Ms.           1   28         28   28   
## 16 Rev.          6   27         57   43.2 
## 17 Sir.          1   49         49   49
# Replace titles as specified
titanic_train <- titanic_train %>%
  mutate(Initial = case_when(
    Initial %in% c("Ms.", "Mlle.", "Mme.","Miss.") ~ "Miss.",
    Initial %in% c("Master.") ~ "Master.",
    Initial %in% c("Countess.", "Mrs.", "Lady.") ~ "Mrs.",
    Initial %in% c("Don.", "Sir.", "Mr.") ~ "Mr.",
    TRUE ~ "Other"
  ))

titanic_train %>% group_by(Initial) %>%
  summarize(Count = n(),
            Average = mean(Age, na.rm = TRUE))
## # A tibble: 5 × 3
##   Initial Count Average
##   <chr>   <int>   <dbl>
## 1 Master.    40    4.57
## 2 Miss.     186   21.9 
## 3 Mr.       519   32.4 
## 4 Mrs.      127   36.0 
## 5 Other      19   46.2
titanic_train <- titanic_train %>%
  mutate(Age = case_when(
    Initial == "Master." ~ if_else(is.na(Age), mean(Age[Initial == "Master."], na.rm = TRUE), Age),
    Initial == "Miss." ~ if_else(is.na(Age), mean(Age[Initial == "Miss."], na.rm = TRUE), Age),
    Initial == "Mr." ~ if_else(is.na(Age), mean(Age[Initial == "Mr."], na.rm = TRUE), Age),
    Initial == "Mrs." ~ if_else(is.na(Age), mean(Age[Initial == "Mrs."], na.rm = TRUE), Age),
    Initial == "Other" ~ if_else(is.na(Age), mean(Age[Initial == "Other"], na.rm = TRUE), Age),
    TRUE ~ Age
  ))

Do the same thing with the test set:

# Create a new column 'Initial' and initialize it with 0
titanic_test$Initial <- 0

# Extract initials from 'Name' column
titanic_test$Initial <- str_extract(titanic_test$Name, "[A-Za-z]+\\.")

# If there are missing values in the 'Initial' column, you may want to handle them
# For example, if there are NA values, you can replace them with a default value
titanic_test$Initial[is.na(titanic_test$Initial)] <- "Unknown"
ggplot(titanic_test, aes(x = Initial, fill = factor(Sex))) +
  geom_bar(position = "dodge") +
  labs(title = "Initial: Sex", x = "Sex", y = "Count") +
  scale_fill_manual(values = c("male" = "skyblue", "female" = "salmon"))

# Replace titles as specified
titanic_test <- titanic_test %>%
  mutate(Initial = case_when(
    Initial %in% c("Ms.", "Miss.") ~ "Miss.",
    Initial %in% c("Master.") ~ "Master.",
    Initial %in% c("Dona.", "Mrs.") ~ "Mrs.",
    Initial %in% c( "Mr.") ~ "Mr.",
    TRUE ~ "Other"
  ))
titanic_test <- titanic_test %>%
  mutate(Age = case_when(
    Initial == "Master." ~ if_else(is.na(Age), mean(Age[Initial == "Master."], na.rm = TRUE), Age),
    Initial == "Miss." ~ if_else(is.na(Age), mean(Age[Initial == "Miss."], na.rm = TRUE), Age),
    Initial == "Mr." ~ if_else(is.na(Age), mean(Age[Initial == "Mr."], na.rm = TRUE), Age),
    Initial == "Mrs." ~ if_else(is.na(Age), mean(Age[Initial == "Mrs."], na.rm = TRUE), Age),
    Initial == "Other" ~ if_else(is.na(Age), mean(Age[Initial == "Other"], na.rm = TRUE), Age),
    TRUE ~ Age
  ))
titanic_train <- titanic_train %>%
  select(-c(Cabin, Ticket, Name, PassengerId, Initial)) %>%
  mutate(Sex = if_else(Sex == "female", 1, 0)) %>%
  mutate(Embarked = case_when(
    Embarked == "C" ~ 0,
    Embarked == "Q" ~ 1,
    Embarked == "S" ~ 2,
    TRUE ~ as.numeric(Embarked)
  ))%>%
  mutate_if(is.numeric, as.numeric)

titanic_test <- titanic_test %>%
  select(-c(Cabin, Ticket, Name, PassengerId,Initial)) %>%
  mutate(Sex = if_else(Sex == "female", 1, 0)) %>%
  mutate(Embarked = case_when(
    Embarked == "C" ~ 0,
    Embarked == "Q" ~ 1,
    Embarked == "S" ~ 2,
    TRUE ~ as.numeric(Embarked)
  ))%>%
  mutate_if(is.numeric, as.numeric)
titanic_train_numeric <- as.data.frame(sapply(titanic_train, as.numeric))

cor_matrix <- cor(titanic_train_numeric)

corrplot(cor_matrix, 
         method="color",
         addCoef.col = "black", 
         type="upper")

Looking at Important Variables to determine how to do Decision Tree

# Assuming your data frame is named df
ggplot(titanic_train, aes(x = factor(Survived))) +
  geom_bar(aes(fill = factor(Survived)), position = "dodge") +
  labs(title = "Survived Histogram", x = "Survived (0 = No, 1 = Yes)", y = "Count") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon"))  # Define colors for 0 and 1

ggplot(titanic_train, aes(x = Sex, fill = factor(Survived))) +
  geom_bar(position = "dodge") +
  labs(title = "Sex: Survived = 1 Dead= 0", x = "Sex", y = "Count") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) 

ggplot(titanic_train, aes(x = Pclass, fill = factor(Survived))) +
  geom_bar(position = "dodge") +
  labs(title = "Pclass: Survived = 1 Dead= 0", x = "Pclass", y = "Count") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) 

ggplot(titanic_train, aes(x = SibSp, fill = factor(Survived))) +
  geom_bar(position = "dodge") +
  labs(title = "SibSp: Survived = 1 Dead= 0", x = "SibSp", y = "Count") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) 

ggplot(titanic_train, aes(x = Parch, fill = factor(Survived))) +
  geom_bar(position = "dodge") +
  labs(title = "Parch: Survived = 1 Dead= 0", x = "Parch", y = "Count") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) 

ggplot(titanic_train, aes(x = Parch, fill = factor(Embarked))) +
  geom_bar(position = "dodge") +
  labs(title = "Embarked: Survived = 1 Dead= 0", x = "Embarked", y = "Count") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) 

ggplot(titanic_train, aes(x = Age)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Age",
       x = "Variable",
       y = "Frequency")

ggplot(titanic_train, aes(x = Fare)) +
  geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Fare",
       x = "Variable",
       y = "Frequency")

Models:

# Split the titanic_train dataset into training and testing sets
set.seed(123) # for reproducibility
train_index <- createDataPartition(titanic_train$Survived, p = 0.7, list = FALSE)
train_data <- titanic_train[train_index, ]
test_data <- titanic_train[-train_index, ]
tree_model_1 <- rpart(Survived ~ Pclass + Age, data = train_data, method = "class")
rpart.plot(tree_model_1)

importances <- tree_model_1$variable.importance
importances <- data.frame(Feature = names(importances), Importance = importances)
importances <- importances[order(importances$Importance, decreasing = TRUE), ][1:6, ]

ggplot(importances, aes(x = Importance, y = Feature)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Feature Importances", x = "Importance", y = "Feature") +
  theme_minimal() +
  theme(plot.title = element_text(size = 20, face = "bold"),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

test_predictions1 <- predict(tree_model_1, newdata = test_data, type = "class")

test_accuracy1 <- mean(test_predictions1 == test_data$Survived)

cat("Test Accuracy:", test_accuracy1, "\n")
## Test Accuracy: 0.6917293
tree_model_2 <- rpart(Survived ~ ., data = train_data, method = "class")
rpart.plot(tree_model_2)

importances <- tree_model_2$variable.importance
importances <- data.frame(Feature = names(importances), Importance = importances)
importances <- importances[order(importances$Importance, decreasing = TRUE), ][1:6, ]

ggplot(importances, aes(x = Importance, y = Feature)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Feature Importances", x = "Importance", y = "Feature") 

test_predictions2 <- predict(tree_model_2, newdata = test_data, type = "class")


test_accuracy2 <- mean(test_predictions2 == test_data$Survived)

cat("Test Accuracy:", test_accuracy2, "\n")
## Test Accuracy: 0.7819549
test_predictions1 <- predict(tree_model_1, newdata = test_data, type = "class")
test_accuracy1 <- mean(test_predictions1 == test_data$Survived)

test_predictions2 <- predict(tree_model_2, newdata = test_data, type = "class")
test_accuracy2 <- mean(test_predictions2 == test_data$Survived)

# Calculate the absolute difference in accuracy between the two models
bias_difference <- abs(test_accuracy1 - test_accuracy2)

cat("Bias difference between the two models:", bias_difference, "\n")
## Bias difference between the two models: 0.09022556
# Calculate the variance of the models
mean_accuracy <- (test_accuracy1 + test_accuracy2) / 2
variance <- abs(test_accuracy1 - mean_accuracy) + abs(test_accuracy2 - mean_accuracy)

cat("Variance of the two models:", variance, "\n")
## Variance of the two models: 0.09022556
set.seed(1234)
rf_model <- randomForest(Survived ~ ., data = titanic_train)
rf_model
## 
## Call:
##  randomForest(formula = Survived ~ ., data = titanic_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.95%
## Confusion matrix:
##     0   1 class.error
## 0 503  46  0.08378871
## 1 105 237  0.30701754
test_predictionsrf <- predict(rf_model, newdata = test_data)

#Confusion matrix
(conf_matrix <- confusionMatrix(test_predictionsrf , test_data$Survived))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 160  24
##          1   4  78
##                                           
##                Accuracy : 0.8947          
##                  95% CI : (0.8515, 0.9289)
##     No Information Rate : 0.6165          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7688          
##                                           
##  Mcnemar's Test P-Value : 0.0003298       
##                                           
##             Sensitivity : 0.9756          
##             Specificity : 0.7647          
##          Pos Pred Value : 0.8696          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.6165          
##          Detection Rate : 0.6015          
##    Detection Prevalence : 0.6917          
##       Balanced Accuracy : 0.8702          
##                                           
##        'Positive' Class : 0               
##