The Titanic dataset analysis aims to predict the likelihood of survival using various features. The dataset contains information such as passenger class (Pclass), sex, age, number of siblings/spouses aboard (SibSp), number of parents/children aboard (Parch), ticket information, fare, cabin, and port of embarkation (Embarked).
Initially, missing values were identified. The Age column contained 177 missing values, which were filled using the median age to minimize skewing the data distribution. Categorical columns such as Survived, Pclass, Sex, and Embarked were converted to factors to make them suitable for modeling. This conversion allows the models to treat them as categorical variables.
Various visualizations were generated for numeric and categorical data to understand distributions and potential relationships:
Histograms and Boxplots for Numeric Columns: These visualizations helped identify the spread and distribution of variables such as Age, Fare, and the number of family members (SibSp, Parch). The boxplots provided insights into the presence of outliers, which could potentially impact the model’s performance. Bar Plots for Character Columns: These plots displayed the count of each unique value in the categorical columns (e.g., Embarked, Sex), allowing an understanding of data distribution and class imbalances.
Three models were created to predict survival: two decision trees with different feature sets and a random forest model.
Decision Tree Models: First Decision Tree: Included basic predictors (Pclass, Sex, Age, Fare). The accuracy was approximately 80.08%. Second Decision Tree: Included additional features such as SibSp and Parch, achieving a slightly lower accuracy of 75.56%. Both models were evaluated using confusion matrices and accuracy calculations. The first model performed better, potentially due to overfitting or the redundancy of additional features in the second model.
Random Forest Model:
A random forest model was built using a comprehensive set of predictors: Pclass, Sex, Age, SibSp, Parch, Fare, and Embarked. This model achieved an accuracy of 80.08%, similar to the first decision tree. Random forests generally provide robust performance by averaging multiple decision trees, which helps reduce overfitting and improve generalizability.
Titanic dataset
df <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Project%202/Titanic-Dataset.csv")
head(df)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
str(df)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
missing_values <- colSums(is.na(df))
print(missing_values)
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)
Making th data more readable by converting the categorical columns to factors
df$Survived <- factor(df$Survived)
df$Pclass <- factor(df$Pclass)
df$Sex <- factor(df$Sex)
df$Embarked <- factor(df$Embarked)
Looking at the boxplots and histograms of the numeric columns
# Loop for numeric columns
for (i in colnames(df)) {
if (is.numeric(df[[i]])) {
hist(df[[i]], main = paste("Histogram of", i), xlab = i)
boxplot(df[[i]], main = paste("Boxplot of", i), ylab = i)
}
}
# Character columns
for (i in colnames(df)) {
if (is.character(df[[i]])) {
counts <- table(df[[i]])
barplot(counts, main = paste("Bar Plot of", i), xlab = i, ylab = "Count", las=2)
}
}
# Splitting the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(df$Survived, p = .7, list = FALSE)
train_set <- df[ trainIndex,]
test_set <- df[-trainIndex,]
Creating the decision trees to predict survival
tree_model1 <- rpart(Survived ~ Pclass + Sex + Age + Fare,
data = train_set, method = "class")
rpart.plot(tree_model1)
predictions1 <- predict(tree_model1, test_set, type = "class")
# Confusion matrix
confusion_matrix1 <- table(test_set$Survived, predictions1)
accuracy1 <- sum(diag(confusion_matrix1)) / sum(confusion_matrix1)
# Print accuracy for the first model
confusion_matrix1
## predictions1
## 0 1
## 0 146 18
## 1 35 67
print(paste("Accuracy of first decision tree:", round(accuracy1 * 100, 2), "%"))
## [1] "Accuracy of first decision tree: 80.08 %"
tree_model2 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch,
data = train_set, method = "class")
rpart.plot(tree_model2)
# Prediction using test set
predictions2 <- predict(tree_model2, test_set, type = "class")
# Confusion matrix
confusion_matrix2 <- table(test_set$Survived, predictions2)
accuracy2 <- sum(diag(confusion_matrix2)) / sum(confusion_matrix2)
confusion_matrix2
## predictions2
## 0 1
## 0 134 30
## 1 35 67
print(paste("Accuracy of second decision tree:", round(accuracy2 * 100, 2), "%"))
## [1] "Accuracy of second decision tree: 75.56 %"
Creating the random forest model
set.seed(123)
rf_model <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = train_set, importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train_set, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 15.52%
## Confusion matrix:
## 0 1 class.error
## 0 351 34 0.08831169
## 1 63 177 0.26250000
# Prediction using the on the test set
rf_predictions <- predict(rf_model, test_set)
# Confusion matrix
rf_confusion_matrix <- table(test_set$Survived, rf_predictions)
rf_accuracy <- sum(diag(rf_confusion_matrix)) / sum(rf_confusion_matrix)
rf_confusion_matrix
## rf_predictions
## 0 1
## 0 146 18
## 1 35 67
print(paste("Accuracy of random forest model:", round(rf_accuracy * 100, 2), "%"))
## [1] "Accuracy of random forest model: 80.08 %"
varImpPlot(rf_model)
To address some of the concerns mentioned in the blog, I took a few important steps. First, I tackled the issue of overfitting by using a random forest model instead of a single decision tree. Single decision trees often fit too closely to the training data, which can lead to poor performance on new data. In contrast, a random forest creates multiple decision trees and averages their results, which helps maintain a balance between bias and variance. This approach led to a better error rate of 15.52%, showing that the model was more capable of generalizing well to unseen data.
I paid attention to feature importance by setting importance = TRUE in the random forest model. This allowed me to see which features played the most significant roles in predicting outcomes. Understanding these influential features is essential for refining the model and making more informed decisions on feature selection and engineering.
Finally, I was careful to prevent data leakage. This meant ensuring that no features included during training could directly or indirectly reveal the target outcome of whether a passenger survived. By carefully selecting features, I avoided biases that could give the model an unrealistically high accuracy. This step helped keep my model’s predictions more reliable and reflective of real-world performance.
The analysis revealed that the model’s performance depends significantly on the chosen features. The random forest model performed comparably to the best decision tree model but with added robustness against overfitting. The visual analysis provided insights into the data distribution, while the predictive models helped identify key survival indicators, such as Sex and Pclass. Future improvements could include fine-tuning hyperparameters, implementing cross-validation, and experimenting with additional feature engineering to enhance model accuracy and reliability.