1. Background & Objectives

This analysis continues the Exploratory Data Analysis (EDA) of the Titanic dataset provided in train.csv. The goals of this assignment are to:

Generate and interpret a pairs plot of quantitative variables
Create four additional plots using combinations of categorical and quantitative variables
Discuss trends and insights in each visualization
Emphasize the importance of multivariate visualizations in EDA
Summarize key findings from the dataset
Create this work in an R Markdown document ready to knit to PDF/HTML

2. Data Preparation

setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv") %>%
  mutate(
    Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
    Pclass = factor(Pclass, ordered = TRUE),
    Sex = factor(Sex),
    Embarked = factor(Embarked)
  )

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, No,…
## $ Pclass      <ord> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

titanic %>% summarise(across(everything(), ~sum(is.na(.))))

## # A tibble: 1 × 12
##   PassengerId Survived Pclass  Name   Sex   Age SibSp Parch Ticket  Fare Cabin
##         <int>    <int>  <int> <int> <int> <int> <int> <int>  <int> <int> <int>
## 1           0        0      0     0     0   177     0     0      0     0   687
## # ℹ 1 more variable: Embarked <int>

3. Pairs Plot for Quantitative Variables

quant_vars <- titanic %>% select(Age, SibSp, Parch, Fare)
ggpairs(
  quant_vars,
  upper = list(continuous = wrap("cor", size = 3)),
  lower = list(continuous = wrap("points", alpha = 0.4, size = 0.7)),
  diag = list(continuous = "densityDiag")
) + ggtitle("Pairs Plot of Age, SibSp, Parch, and Fare")

Interpretation of Pairs Plot

The pairs plot reveals three major relationships:

Fare and Parch show a moderate positive correlation (r ≈ 0.40), suggesting passengers who paid higher fares often traveled with family.
Age is negatively associated with SibSp and Parch, implying younger passengers were more likely to be accompanied by relatives.
No variable pair shows high multicollinearity, which supports the use of all four variables in future modeling.

4. Four Additional Plots

g1 <- ggplot(titanic, aes(Sex, fill = Survived)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Survival Proportion by Sex", x = "Sex", y = "Proportion")

g2 <- ggplot(titanic, aes(Pclass, fill = Survived)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Survival Proportion by Class", x = "Passenger Class", y = "Proportion")

g3 <- ggplot(titanic, aes(Survived, Age, fill = Survived)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Age Distribution by Survival", x = "Survived", y = "Age")

g4 <- ggplot(titanic, aes(Age, Fare, color = Survived)) +
  geom_jitter(alpha = 0.5, width = 0.5) +
  labs(title = "Fare vs. Age Colored by Survival", x = "Age", y = "Fare")

(g1 | g2) / (g3 | g4) + plot_annotation(title = "Multivariate and Bivariate Plots")

Interpretation of Additional Plots

Survival by Sex: Female passengers had a survival rate over 70%, compared to ~19% for males—demonstrating the influence of evacuation protocols.
Survival by Class: First-class passengers had the highest survival rate (~63%), while third-class passengers had the lowest (~24%), reflecting disparities in access.
Age by Survival: Survivors tended to be younger, with many very young children surviving while older passengers saw greater mortality.
Fare vs. Age Colored by Survival: Passengers who paid higher fares and were middle-aged had higher survival odds, suggesting that wealth and mobility mattered.

5. Importance of Multivariate Data Visualization

Multivariate data visualization enables the analyst to see more than two dimensions of data simultaneously. It helps uncover interaction effects, conditional trends, and complex relationships. For example, the scatterplot of Fare vs. Age colored by Survival highlights how both economic and demographic factors together affected survival probability. Such visual tools are essential to build intuition, detect data quality issues, and support theory generation.

6. Summary of EDA Findings

This analysis revealed several key insights:

Sex and Class are dominant factors affecting survival
Age and Fare contribute quantitative nuance and reveal important interaction patterns
Multivariate visualization clarified overlapping effects that bivariate analysis may obscure
The dataset contains minimal missingness except for Cabin (heavily missing), which can be excluded

The EDA process involved transforming categorical variables, using color and layout strategically, and selecting plot types aligned with variable types. Together, these choices allowed for meaningful pattern recognition across the dataset.

7. Missing Data Summary

titanic %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  knitr::kable(caption = "Count of Missing Values by Column")

Count of Missing Values by Column
PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	0	0	0	177	0	0	0	0	687	2

Handling Plan:

Impute Age with the median
Impute Embarked with the mode
Exclude Cabin due to over 50% missingness

K.Campise85014Knit

Kat Campise

May 05, 2025