This analysis continues the Exploratory Data Analysis (EDA) of the
Titanic dataset provided in train.csv. The goals of this
assignment are to:
setwd("C:/DDS 8501 Titanic")
titanic <- read_csv("train.csv") %>%
mutate(
Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")),
Pclass = factor(Pclass, ordered = TRUE),
Sex = factor(Sex),
Embarked = factor(Embarked)
)
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, No,…
## $ Pclass <ord> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <fct> male, female, female, female, male, male, male, male, fema…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
titanic %>% summarise(across(everything(), ~sum(is.na(.))))
## # A tibble: 1 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 177 0 0 0 0 687
## # ℹ 1 more variable: Embarked <int>
quant_vars <- titanic %>% select(Age, SibSp, Parch, Fare)
ggpairs(
quant_vars,
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = 0.4, size = 0.7)),
diag = list(continuous = "densityDiag")
) + ggtitle("Pairs Plot of Age, SibSp, Parch, and Fare")
The pairs plot reveals three major relationships:
g1 <- ggplot(titanic, aes(Sex, fill = Survived)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent_format()) +
labs(title = "Survival Proportion by Sex", x = "Sex", y = "Proportion")
g2 <- ggplot(titanic, aes(Pclass, fill = Survived)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent_format()) +
labs(title = "Survival Proportion by Class", x = "Passenger Class", y = "Proportion")
g3 <- ggplot(titanic, aes(Survived, Age, fill = Survived)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Age Distribution by Survival", x = "Survived", y = "Age")
g4 <- ggplot(titanic, aes(Age, Fare, color = Survived)) +
geom_jitter(alpha = 0.5, width = 0.5) +
labs(title = "Fare vs. Age Colored by Survival", x = "Age", y = "Fare")
(g1 | g2) / (g3 | g4) + plot_annotation(title = "Multivariate and Bivariate Plots")
Multivariate data visualization enables the analyst to see more than two dimensions of data simultaneously. It helps uncover interaction effects, conditional trends, and complex relationships. For example, the scatterplot of Fare vs. Age colored by Survival highlights how both economic and demographic factors together affected survival probability. Such visual tools are essential to build intuition, detect data quality issues, and support theory generation.
This analysis revealed several key insights:
The EDA process involved transforming categorical variables, using color and layout strategically, and selecting plot types aligned with variable types. Together, these choices allowed for meaningful pattern recognition across the dataset.
titanic %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
knitr::kable(caption = "Count of Missing Values by Column")
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 177 | 0 | 0 | 0 | 0 | 687 | 2 |