The decision tree chosen for analysis pertains to a Kaggle competition aimed at predicting which passengers survived the Titanic shipwreck. This particular dataset is divided into a training set and a testing set. Initially, a thorough examination of the variables’ summary provided insight into the nature of the data. This step involved identifying missing values, determining whether variables were continuous or categorical, and devising strategies to handle missing data.
One such strategy involved imputing missing values for the “Embarked” variable by replacing them with the highest frequency category, which was “S”. Similarly, missing fare data in the “titanic_test” dataset were replaced with the mean fare amount since fare was a continuous variable. Addressing missing age data required a more nuanced approach. Leveraging feature engineering techniques, the passengers’ names were used to infer their ages. By analyzing passengers’ titles (“Miss”, “Mrs”, etc.), average age ranges were determined, enabling the imputation of missing age values.
To streamline the analysis process, certain variables were deemed irrelevant and consequently dropped from consideration. For instance, the “Cabin” variable was discarded due to a high proportion of missing values, while “Ticket” and “PassengerId” were excluded as they served as identifier variables. Categorical variables were transformed into binary or dummy variables to facilitate the construction of decision tree and random forest models.
Following data preprocessing, a correlation matrix was computed to ensure no issues of collinearity existed among variables. This step is crucial as collinearity can adversely impact the performance of predictive models. Upon confirming the absence of collinearity, attention shifted to identifying the most influential variables. Sex and class emerged as pivotal predictors of survival, as evidenced by their strong correlations with the outcome variable.
To assess model performance, two decision tree models were built, one without the most important variable (sex) and another incorporating all variables. This comparative analysis allowed for the evaluation of bias and variance between the two models. The observed bias difference of 0.09022556 indicated a significant disparity in performance, suggesting that one model consistently outperformed the other across the testing dataset. Additionally, the variance of 0.09022556 highlighted substantial variability in predictions, indicating significant deviations from the mean accuracy of both models.
Subsequently, a random forest model was constructed, yielding the highest accuracy of all models, with an accuracy score of 0.8947. Random forest models are known for their robustness and ability to handle complex datasets, making them particularly well-suited for predictive tasks such as survival prediction in the Titanic dataset.
The analysis of the Titanic dataset using decision tree and random forest models not only highlights their efficacy but also underscores the importance of addressing various concerns associated with their usage. When dealing with complex topics, such as predicting passenger survival, decision trees may become large and unwieldy, posing challenges in maintenance and version control. To tackle this issue, it is crucial to leverage tools that enable version control of decision trees, facilitating collaboration and ensuring the orderly evolution of models over time.
In conclusion, the analysis of the Titanic dataset showcased the effectiveness of decision tree and random forest models in predicting passenger survival. By leveraging feature engineering, preprocessing techniques, and model evaluation strategies, valuable insights were gleaned regarding the influential factors determining survival outcomes. These findings underscore the importance of robust data analysis methodologies in deriving actionable insights from complex datasets.
https://www.kaggle.com/competitions/titanic/data?select=test.csv
titanic_train <- titanic_train %>%
mutate_if(is.integer, as.factor)%>%
mutate_if(is.character, ~factor(replace(., . == "", NA)))
titanic_test <- titanic_test %>%
mutate_if(is.integer, as.factor)%>%
mutate_if(is.character, ~factor(replace(., . == "", NA)))
summary(titanic_train)
## PassengerId Survived Pclass Name
## 1 : 1 0:549 1:216 Abbing, Mr. Anthony : 1
## 2 : 1 1:342 2:184 Abbott, Mr. Rossmore Edward : 1
## 3 : 1 3:491 Abbott, Mrs. Stanton (Rosa Hunt) : 1
## 4 : 1 Abelson, Mr. Samuel : 1
## 5 : 1 Abelson, Mrs. Samuel (Hannah Wizosky): 1
## 6 : 1 Adahl, Mr. Mauritz Nils Martin : 1
## (Other):885 (Other) :885
## Sex Age SibSp Parch Ticket Fare
## female:314 Min. : 0.42 0:608 0:678 1601 : 7 Min. : 0.00
## male :577 1st Qu.:20.12 1:209 1:118 347082 : 7 1st Qu.: 7.91
## Median :28.00 2: 28 2: 80 CA. 2343: 7 Median : 14.45
## Mean :29.70 3: 16 3: 5 3101295 : 6 Mean : 32.20
## 3rd Qu.:38.00 4: 18 4: 4 347088 : 6 3rd Qu.: 31.00
## Max. :80.00 5: 5 5: 5 CA 2144 : 6 Max. :512.33
## NA's :177 8: 7 6: 1 (Other) :852
## Cabin Embarked
## B96 B98 : 4 C :168
## C23 C25 C27: 4 Q : 77
## G6 : 4 S :644
## C22 C26 : 3 NA's: 2
## D : 3
## (Other) :186
## NA's :687
summary(titanic_test)
## PassengerId Pclass Name
## 892 : 1 1:107 Abbott, Master. Eugene Joseph : 1
## 893 : 1 2: 93 Abelseth, Miss. Karen Marie : 1
## 894 : 1 3:218 Abelseth, Mr. Olaus Jorgensen : 1
## 895 : 1 Abrahamsson, Mr. Abraham August Johannes : 1
## 896 : 1 Abrahim, Mrs. Joseph (Sophie Halaut Easu): 1
## 897 : 1 Aks, Master. Philip Frank : 1
## (Other):412 (Other) :412
## Sex Age SibSp Parch Ticket
## female:152 Min. : 0.17 0:283 0 :324 PC 17608: 5
## male :266 1st Qu.:21.00 1:110 1 : 52 113503 : 4
## Median :27.00 2: 14 2 : 33 CA. 2343: 4
## Mean :30.27 3: 4 3 : 3 16966 : 3
## 3rd Qu.:39.00 4: 4 4 : 2 220845 : 3
## Max. :76.00 5: 1 9 : 2 347077 : 3
## NA's :86 8: 2 (Other): 2 (Other) :396
## Fare Cabin Embarked
## Min. : 0.000 B57 B59 B63 B66: 3 C:102
## 1st Qu.: 7.896 A34 : 2 Q: 46
## Median : 14.454 B45 : 2 S:270
## Mean : 35.627 C101 : 2
## 3rd Qu.: 31.500 C116 : 2
## Max. :512.329 (Other) : 80
## NA's :1 NA's :327
Inputting missing values: Embarked:
ggplot(titanic_train, aes(x = factor(Embarked))) +
geom_bar() +
labs(title = "Embarked Count", x = "Embarked", y = "Count")
Since there are a lot of passenger embarking from S and only 2
Na embarked values they will be replaced by S.
titanic_train$Embarked <- replace_na(titanic_train$Embarked, "S")
Because there is only one variable missing from the test data in the Fare column, impute with the mean of the fair amount
titanic_test <- titanic_test %>%
mutate(Fare = if_else(is.na(Fare ), mean(Fare, na.rm = TRUE), Fare))
Decided to drop Cabinet column because there are two many missing variables and if they do help us determine one thing it’s financial class which we have Pclass for.
How to determine the Age though. Could do it by the mane but what if there is a better way.
# Create a new column 'Initial' and initialize it with 0
titanic_train$Initial <- 0
# Extract initials from 'Name' column
titanic_train$Initial <- str_extract(titanic_train$Name, "[A-Za-z]+\\.")
# If there are missing values in the 'Initial' column, you may want to handle them
# For example, if there are NA values, you can replace them with a default value
titanic_train$Initial[is.na(titanic_train$Initial)] <- "Unknown"
ggplot(titanic_train, aes(x = Initial, fill = factor(Sex))) +
geom_bar(position = "dodge") +
labs(title = "Initial: Sex", x = "Sex", y = "Count") +
scale_fill_manual(values = c("male" = "skyblue", "female" = "salmon"))
# Calculate age range for each initial
initial_age_range <- titanic_train %>%
group_by(Initial) %>%
summarize(Count = n(),
Min_Age = min(Age, na.rm = TRUE),
Max_Age = max(Age, na.rm = TRUE),
Average = mean(Age, na.rm = TRUE))
# Print the result
print(initial_age_range)
## # A tibble: 17 × 5
## Initial Count Min_Age Max_Age Average
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Capt. 1 70 70 70
## 2 Col. 2 56 60 58
## 3 Countess. 1 33 33 33
## 4 Don. 1 40 40 40
## 5 Dr. 7 23 54 42
## 6 Jonkheer. 1 38 38 38
## 7 Lady. 1 48 48 48
## 8 Major. 2 45 52 48.5
## 9 Master. 40 0.42 12 4.57
## 10 Miss. 182 0.75 63 21.8
## 11 Mlle. 2 24 24 24
## 12 Mme. 1 24 24 24
## 13 Mr. 517 11 80 32.4
## 14 Mrs. 125 14 63 35.9
## 15 Ms. 1 28 28 28
## 16 Rev. 6 27 57 43.2
## 17 Sir. 1 49 49 49
# Replace titles as specified
titanic_train <- titanic_train %>%
mutate(Initial = case_when(
Initial %in% c("Ms.", "Mlle.", "Mme.","Miss.") ~ "Miss.",
Initial %in% c("Master.") ~ "Master.",
Initial %in% c("Countess.", "Mrs.", "Lady.") ~ "Mrs.",
Initial %in% c("Don.", "Sir.", "Mr.") ~ "Mr.",
TRUE ~ "Other"
))
titanic_train %>% group_by(Initial) %>%
summarize(Count = n(),
Average = mean(Age, na.rm = TRUE))
## # A tibble: 5 × 3
## Initial Count Average
## <chr> <int> <dbl>
## 1 Master. 40 4.57
## 2 Miss. 186 21.9
## 3 Mr. 519 32.4
## 4 Mrs. 127 36.0
## 5 Other 19 46.2
titanic_train <- titanic_train %>%
mutate(Age = case_when(
Initial == "Master." ~ if_else(is.na(Age), mean(Age[Initial == "Master."], na.rm = TRUE), Age),
Initial == "Miss." ~ if_else(is.na(Age), mean(Age[Initial == "Miss."], na.rm = TRUE), Age),
Initial == "Mr." ~ if_else(is.na(Age), mean(Age[Initial == "Mr."], na.rm = TRUE), Age),
Initial == "Mrs." ~ if_else(is.na(Age), mean(Age[Initial == "Mrs."], na.rm = TRUE), Age),
Initial == "Other" ~ if_else(is.na(Age), mean(Age[Initial == "Other"], na.rm = TRUE), Age),
TRUE ~ Age
))
Do the same thing with the test set:
# Create a new column 'Initial' and initialize it with 0
titanic_test$Initial <- 0
# Extract initials from 'Name' column
titanic_test$Initial <- str_extract(titanic_test$Name, "[A-Za-z]+\\.")
# If there are missing values in the 'Initial' column, you may want to handle them
# For example, if there are NA values, you can replace them with a default value
titanic_test$Initial[is.na(titanic_test$Initial)] <- "Unknown"
ggplot(titanic_test, aes(x = Initial, fill = factor(Sex))) +
geom_bar(position = "dodge") +
labs(title = "Initial: Sex", x = "Sex", y = "Count") +
scale_fill_manual(values = c("male" = "skyblue", "female" = "salmon"))
# Replace titles as specified
titanic_test <- titanic_test %>%
mutate(Initial = case_when(
Initial %in% c("Ms.", "Miss.") ~ "Miss.",
Initial %in% c("Master.") ~ "Master.",
Initial %in% c("Dona.", "Mrs.") ~ "Mrs.",
Initial %in% c( "Mr.") ~ "Mr.",
TRUE ~ "Other"
))
titanic_test <- titanic_test %>%
mutate(Age = case_when(
Initial == "Master." ~ if_else(is.na(Age), mean(Age[Initial == "Master."], na.rm = TRUE), Age),
Initial == "Miss." ~ if_else(is.na(Age), mean(Age[Initial == "Miss."], na.rm = TRUE), Age),
Initial == "Mr." ~ if_else(is.na(Age), mean(Age[Initial == "Mr."], na.rm = TRUE), Age),
Initial == "Mrs." ~ if_else(is.na(Age), mean(Age[Initial == "Mrs."], na.rm = TRUE), Age),
Initial == "Other" ~ if_else(is.na(Age), mean(Age[Initial == "Other"], na.rm = TRUE), Age),
TRUE ~ Age
))
titanic_train <- titanic_train %>%
select(-c(Cabin, Ticket, Name, PassengerId, Initial)) %>%
mutate(Sex = if_else(Sex == "female", 1, 0)) %>%
mutate(Embarked = case_when(
Embarked == "C" ~ 0,
Embarked == "Q" ~ 1,
Embarked == "S" ~ 2,
TRUE ~ as.numeric(Embarked)
))%>%
mutate_if(is.numeric, as.numeric)
titanic_test <- titanic_test %>%
select(-c(Cabin, Ticket, Name, PassengerId,Initial)) %>%
mutate(Sex = if_else(Sex == "female", 1, 0)) %>%
mutate(Embarked = case_when(
Embarked == "C" ~ 0,
Embarked == "Q" ~ 1,
Embarked == "S" ~ 2,
TRUE ~ as.numeric(Embarked)
))%>%
mutate_if(is.numeric, as.numeric)
titanic_train_numeric <- as.data.frame(sapply(titanic_train, as.numeric))
cor_matrix <- cor(titanic_train_numeric)
corrplot(cor_matrix,
method="color",
addCoef.col = "black",
type="upper")
Looking at Important Variables to determine how to do Decision Tree
# Assuming your data frame is named df
ggplot(titanic_train, aes(x = factor(Survived))) +
geom_bar(aes(fill = factor(Survived)), position = "dodge") +
labs(title = "Survived Histogram", x = "Survived (0 = No, 1 = Yes)", y = "Count") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) # Define colors for 0 and 1
ggplot(titanic_train, aes(x = Sex, fill = factor(Survived))) +
geom_bar(position = "dodge") +
labs(title = "Sex: Survived = 1 Dead= 0", x = "Sex", y = "Count") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon"))
ggplot(titanic_train, aes(x = Pclass, fill = factor(Survived))) +
geom_bar(position = "dodge") +
labs(title = "Pclass: Survived = 1 Dead= 0", x = "Pclass", y = "Count") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon"))
ggplot(titanic_train, aes(x = SibSp, fill = factor(Survived))) +
geom_bar(position = "dodge") +
labs(title = "SibSp: Survived = 1 Dead= 0", x = "SibSp", y = "Count") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon"))
ggplot(titanic_train, aes(x = Parch, fill = factor(Survived))) +
geom_bar(position = "dodge") +
labs(title = "Parch: Survived = 1 Dead= 0", x = "Parch", y = "Count") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon"))
ggplot(titanic_train, aes(x = Parch, fill = factor(Embarked))) +
geom_bar(position = "dodge") +
labs(title = "Embarked: Survived = 1 Dead= 0", x = "Embarked", y = "Count") +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon"))
ggplot(titanic_train, aes(x = Age)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
labs(title = "Histogram of Age",
x = "Variable",
y = "Frequency")
ggplot(titanic_train, aes(x = Fare)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Histogram of Fare",
x = "Variable",
y = "Frequency")
# Split the titanic_train dataset into training and testing sets
set.seed(123) # for reproducibility
train_index <- createDataPartition(titanic_train$Survived, p = 0.7, list = FALSE)
train_data <- titanic_train[train_index, ]
test_data <- titanic_train[-train_index, ]
tree_model_1 <- rpart(Survived ~ Pclass + Age, data = train_data, method = "class")
rpart.plot(tree_model_1)
importances <- tree_model_1$variable.importance
importances <- data.frame(Feature = names(importances), Importance = importances)
importances <- importances[order(importances$Importance, decreasing = TRUE), ][1:6, ]
ggplot(importances, aes(x = Importance, y = Feature)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Feature Importances", x = "Importance", y = "Feature") +
theme_minimal() +
theme(plot.title = element_text(size = 20, face = "bold"),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
test_predictions1 <- predict(tree_model_1, newdata = test_data, type = "class")
test_accuracy1 <- mean(test_predictions1 == test_data$Survived)
cat("Test Accuracy:", test_accuracy1, "\n")
## Test Accuracy: 0.6917293
tree_model_2 <- rpart(Survived ~ ., data = train_data, method = "class")
rpart.plot(tree_model_2)
importances <- tree_model_2$variable.importance
importances <- data.frame(Feature = names(importances), Importance = importances)
importances <- importances[order(importances$Importance, decreasing = TRUE), ][1:6, ]
ggplot(importances, aes(x = Importance, y = Feature)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Feature Importances", x = "Importance", y = "Feature")
test_predictions2 <- predict(tree_model_2, newdata = test_data, type = "class")
test_accuracy2 <- mean(test_predictions2 == test_data$Survived)
cat("Test Accuracy:", test_accuracy2, "\n")
## Test Accuracy: 0.7819549
test_predictions1 <- predict(tree_model_1, newdata = test_data, type = "class")
test_accuracy1 <- mean(test_predictions1 == test_data$Survived)
test_predictions2 <- predict(tree_model_2, newdata = test_data, type = "class")
test_accuracy2 <- mean(test_predictions2 == test_data$Survived)
# Calculate the absolute difference in accuracy between the two models
bias_difference <- abs(test_accuracy1 - test_accuracy2)
cat("Bias difference between the two models:", bias_difference, "\n")
## Bias difference between the two models: 0.09022556
# Calculate the variance of the models
mean_accuracy <- (test_accuracy1 + test_accuracy2) / 2
variance <- abs(test_accuracy1 - mean_accuracy) + abs(test_accuracy2 - mean_accuracy)
cat("Variance of the two models:", variance, "\n")
## Variance of the two models: 0.09022556
set.seed(1234)
rf_model <- randomForest(Survived ~ ., data = titanic_train)
rf_model
##
## Call:
## randomForest(formula = Survived ~ ., data = titanic_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 16.95%
## Confusion matrix:
## 0 1 class.error
## 0 503 46 0.08378871
## 1 105 237 0.30701754
test_predictionsrf <- predict(rf_model, newdata = test_data)
#Confusion matrix
(conf_matrix <- confusionMatrix(test_predictionsrf , test_data$Survived))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 160 24
## 1 4 78
##
## Accuracy : 0.8947
## 95% CI : (0.8515, 0.9289)
## No Information Rate : 0.6165
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7688
##
## Mcnemar's Test P-Value : 0.0003298
##
## Sensitivity : 0.9756
## Specificity : 0.7647
## Pos Pred Value : 0.8696
## Neg Pred Value : 0.9512
## Prevalence : 0.6165
## Detection Rate : 0.6015
## Detection Prevalence : 0.6917
## Balanced Accuracy : 0.8702
##
## 'Positive' Class : 0
##