Analysis of the Titanic Dataset

The Titanic dataset analysis aims to predict the likelihood of survival using various features. The dataset contains information such as passenger class (Pclass), sex, age, number of siblings/spouses aboard (SibSp), number of parents/children aboard (Parch), ticket information, fare, cabin, and port of embarkation (Embarked).

Data Cleaning and Preparation:

Initially, missing values were identified. The Age column contained 177 missing values, which were filled using the median age to minimize skewing the data distribution. Categorical columns such as Survived, Pclass, Sex, and Embarked were converted to factors to make them suitable for modeling. This conversion allows the models to treat them as categorical variables.

Exploratory Data Analysis (EDA):

Various visualizations were generated for numeric and categorical data to understand distributions and potential relationships:

Histograms and Boxplots for Numeric Columns: These visualizations helped identify the spread and distribution of variables such as Age, Fare, and the number of family members (SibSp, Parch). The boxplots provided insights into the presence of outliers, which could potentially impact the model’s performance. Bar Plots for Character Columns: These plots displayed the count of each unique value in the categorical columns (e.g., Embarked, Sex), allowing an understanding of data distribution and class imbalances.

Model Development:

Three models were created to predict survival: two decision trees with different feature sets and a random forest model.

Decision Tree Models: First Decision Tree: Included basic predictors (Pclass, Sex, Age, Fare). The accuracy was approximately 80.08%. Second Decision Tree: Included additional features such as SibSp and Parch, achieving a slightly lower accuracy of 75.56%. Both models were evaluated using confusion matrices and accuracy calculations. The first model performed better, potentially due to overfitting or the redundancy of additional features in the second model.

Random Forest Model:

A random forest model was built using a comprehensive set of predictors: Pclass, Sex, Age, SibSp, Parch, Fare, and Embarked. This model achieved an accuracy of 80.08%, similar to the first decision tree. Random forests generally provide robust performance by averaging multiple decision trees, which helps reduce overfitting and improve generalizability.

Titanic dataset

df <- read.csv("https://raw.githubusercontent.com/Mikhail-Broomes/Data-622-/refs/heads/main/Project%202/Titanic-Dataset.csv")

head(df)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

str(df)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Fixing missing data

missing_values <- colSums(is.na(df))
print(missing_values)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE)

Expolatory Data Analysis

Making th data more readable by converting the categorical columns to factors

df$Survived <- factor(df$Survived)
df$Pclass <- factor(df$Pclass)
df$Sex <- factor(df$Sex)
df$Embarked <- factor(df$Embarked)

Looking at the boxplots and histograms of the numeric columns

# Loop for numeric columns
for (i in colnames(df)) {
  if (is.numeric(df[[i]])) {
    hist(df[[i]], main = paste("Histogram of", i), xlab = i)
    boxplot(df[[i]], main = paste("Boxplot of", i), ylab = i)
  }
}

# Character columns

for (i in colnames(df)) {
  if (is.character(df[[i]])) {

    counts <- table(df[[i]])
    barplot(counts, main = paste("Bar Plot of", i), xlab = i, ylab = "Count", las=2)
  }
}

Training sets

# Splitting the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(df$Survived, p = .7, list = FALSE)
train_set <- df[ trainIndex,]
test_set  <- df[-trainIndex,]

Creating the decision trees to predict survival

tree_model1 <- rpart(Survived ~ Pclass + Sex + Age + Fare, 
                      data = train_set, method = "class")

rpart.plot(tree_model1)

predictions1 <- predict(tree_model1, test_set, type = "class")

# Confusion matrix 
confusion_matrix1 <- table(test_set$Survived, predictions1)
accuracy1 <- sum(diag(confusion_matrix1)) / sum(confusion_matrix1)

# Print accuracy for the first model
confusion_matrix1

##    predictions1
##       0   1
##   0 146  18
##   1  35  67

print(paste("Accuracy of first decision tree:", round(accuracy1 * 100, 2), "%"))

## [1] "Accuracy of first decision tree: 80.08 %"

tree_model2 <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch, 
                      data = train_set, method = "class")

rpart.plot(tree_model2)

# Prediction using test set
predictions2 <- predict(tree_model2, test_set, type = "class")

# Confusion matrix 
confusion_matrix2 <- table(test_set$Survived, predictions2)
accuracy2 <- sum(diag(confusion_matrix2)) / sum(confusion_matrix2)

confusion_matrix2

##    predictions2
##       0   1
##   0 134  30
##   1  35  67

print(paste("Accuracy of second decision tree:", round(accuracy2 * 100, 2), "%"))

## [1] "Accuracy of second decision tree: 75.56 %"

Creating the random forest model

set.seed(123)

rf_model <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, 
                         data = train_set, importance = TRUE)
print(rf_model)

## 
## Call:
##  randomForest(formula = Survived ~ Pclass + Sex + Age + SibSp +      Parch + Fare + Embarked, data = train_set, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 15.52%
## Confusion matrix:
##     0   1 class.error
## 0 351  34  0.08831169
## 1  63 177  0.26250000

# Prediction using the on the test set
rf_predictions <- predict(rf_model, test_set)

# Confusion matrix 
rf_confusion_matrix <- table(test_set$Survived, rf_predictions)
rf_accuracy <- sum(diag(rf_confusion_matrix)) / sum(rf_confusion_matrix)

rf_confusion_matrix

##    rf_predictions
##       0   1
##   0 146  18
##   1  35  67

print(paste("Accuracy of random forest model:", round(rf_accuracy * 100, 2), "%"))

## [1] "Accuracy of random forest model: 80.08 %"

varImpPlot(rf_model)

Adressing Concerns

To address some of the concerns mentioned in the blog, I took a few important steps. First, I tackled the issue of overfitting by using a random forest model instead of a single decision tree. Single decision trees often fit too closely to the training data, which can lead to poor performance on new data. In contrast, a random forest creates multiple decision trees and averages their results, which helps maintain a balance between bias and variance. This approach led to a better error rate of 15.52%, showing that the model was more capable of generalizing well to unseen data.

I paid attention to feature importance by setting importance = TRUE in the random forest model. This allowed me to see which features played the most significant roles in predicting outcomes. Understanding these influential features is essential for refining the model and making more informed decisions on feature selection and engineering.

Finally, I was careful to prevent data leakage. This meant ensuring that no features included during training could directly or indirectly reveal the target outcome of whether a passenger survived. By carefully selecting features, I avoided biases that could give the model an unrealistically high accuracy. This step helped keep my model’s predictions more reliable and reflective of real-world performance.

Conclusion

The analysis revealed that the model’s performance depends significantly on the chosen features. The random forest model performed comparably to the best decision tree model but with added robustness against overfitting. The visual analysis provided insights into the data distribution, while the predictive models helped identify key survival indicators, such as Sex and Pclass. Future improvements could include fine-tuning hyperparameters, implementing cross-validation, and experimenting with additional feature engineering to enhance model accuracy and reliability.

Data 622 HW 2

2024-10-31