1. Background & Objectives

This analysis continues the Exploratory Data Analysis (EDA) of the Titanic dataset provided in train.csv. The goals of this assignment are to:

2. Data Preparation

setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv") %>%
  mutate(
    Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
    Pclass = factor(Pclass, ordered = TRUE),
    Sex = factor(Sex),
    Embarked = factor(Embarked)
  )

glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, No,…
## $ Pclass      <ord> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
titanic %>% summarise(across(everything(), ~sum(is.na(.))))
## # A tibble: 1 × 12
##   PassengerId Survived Pclass  Name   Sex   Age SibSp Parch Ticket  Fare Cabin
##         <int>    <int>  <int> <int> <int> <int> <int> <int>  <int> <int> <int>
## 1           0        0      0     0     0   177     0     0      0     0   687
## # ℹ 1 more variable: Embarked <int>

3. Pairs Plot for Quantitative Variables

quant_vars <- titanic %>% select(Age, SibSp, Parch, Fare)
ggpairs(
  quant_vars,
  upper = list(continuous = wrap("cor", size = 3)),
  lower = list(continuous = wrap("points", alpha = 0.4, size = 0.7)),
  diag = list(continuous = "densityDiag")
) + ggtitle("Pairs Plot of Age, SibSp, Parch, and Fare")

Interpretation of Pairs Plot

The pairs plot reveals three major relationships:

  • Fare and Parch show a moderate positive correlation (r ≈ 0.40), suggesting passengers who paid higher fares often traveled with family.
  • Age is negatively associated with SibSp and Parch, implying younger passengers were more likely to be accompanied by relatives.
  • No variable pair shows high multicollinearity, which supports the use of all four variables in future modeling.

4. Four Additional Plots

g1 <- ggplot(titanic, aes(Sex, fill = Survived)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Survival Proportion by Sex", x = "Sex", y = "Proportion")

g2 <- ggplot(titanic, aes(Pclass, fill = Survived)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Survival Proportion by Class", x = "Passenger Class", y = "Proportion")

g3 <- ggplot(titanic, aes(Survived, Age, fill = Survived)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Age Distribution by Survival", x = "Survived", y = "Age")

g4 <- ggplot(titanic, aes(Age, Fare, color = Survived)) +
  geom_jitter(alpha = 0.5, width = 0.5) +
  labs(title = "Fare vs. Age Colored by Survival", x = "Age", y = "Fare")

(g1 | g2) / (g3 | g4) + plot_annotation(title = "Multivariate and Bivariate Plots")

Interpretation of Additional Plots

  1. Survival by Sex: Female passengers had a survival rate over 70%, compared to ~19% for males—demonstrating the influence of evacuation protocols.
  2. Survival by Class: First-class passengers had the highest survival rate (~63%), while third-class passengers had the lowest (~24%), reflecting disparities in access.
  3. Age by Survival: Survivors tended to be younger, with many very young children surviving while older passengers saw greater mortality.
  4. Fare vs. Age Colored by Survival: Passengers who paid higher fares and were middle-aged had higher survival odds, suggesting that wealth and mobility mattered.

5. Importance of Multivariate Data Visualization

Multivariate data visualization enables the analyst to see more than two dimensions of data simultaneously. It helps uncover interaction effects, conditional trends, and complex relationships. For example, the scatterplot of Fare vs. Age colored by Survival highlights how both economic and demographic factors together affected survival probability. Such visual tools are essential to build intuition, detect data quality issues, and support theory generation.

6. Summary of EDA Findings

This analysis revealed several key insights:

The EDA process involved transforming categorical variables, using color and layout strategically, and selecting plot types aligned with variable types. Together, these choices allowed for meaningful pattern recognition across the dataset.

7. Missing Data Summary

titanic %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  knitr::kable(caption = "Count of Missing Values by Column")
Count of Missing Values by Column
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 0 0 0 177 0 0 0 0 687 2

Handling Plan:

  • Impute Age with the median
  • Impute Embarked with the mode
  • Exclude Cabin due to over 50% missingness