QBS181_Group_Project

Question: Before applying statistical tests or predictive models, we wanted to know — are any numeric features strongly correlated with each other? For example, could “family size” simply duplicate the information already contained in “SibSp” (number of siblings/spouses) and “Parch” (number of parents/children)?

Interpretation: The correlation heatmap helps us visually inspect relationships among continuous variables such as Age, Fare, SibSp, Parch, and FamilySize. From the plot, we can see that FamilySize is highly correlated with SibSp (r ≈ 0.86) and Parch (r ≈ 0.79) — which makes sense, since FamilySize is derived from these two features. Other variables, such as Age and Fare, show only weak correlations with the rest. This suggests that multicollinearity is mainly present among the family-related variables, while other numeric variables are largely independent. Understanding these correlations allows us to decide which variables to keep or combine for downstream statistical or machine-learning analyses.

# Load required packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.5.2

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

# Read the Titanic dataset
titanic <- read.csv("Total_Cleaned.csv")

# Add FamilySize variable (SibSp + Parch + 1 = total number of people in the family)
titanic <- titanic %>%
  mutate(FamilySize = SibSp + Parch + 1)

# Select numeric variables for correlation analysis
num_vars <- titanic %>%
  select(Age, Fare, SibSp, Parch, FamilySize) %>%
  mutate_all(as.numeric)

# Compute correlation matrix
cor_matrix <- cor(num_vars, use = "complete.obs")
cor_melt <- melt(cor_matrix)

# Plot correlation heatmap
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
  scale_fill_viridis_c(option = "A") +
  labs(
    title = "Correlation Heatmap of Numeric Features",
    subtitle = "Inspecting multicollinearity among Age, Fare, SibSp, Parch, and FamilySize",
    x = "",
    y = ""
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(face = "italic", size = 12)
  )

# Use Titanic dataset
df <- titanic

# Compute mean age per sex
mu <- df %>%
  group_by(Sex) %>%
  summarise(grp.mean = mean(Age, na.rm = TRUE))

# Custom labeller for facet titles (uppercase)
capitalize_labeller <- as_labeller(function(value) toupper(value))

# Calculate scaling factor for density overlay
age_range <- diff(range(df$Age, na.rm = TRUE))
n_obs <- nrow(df)
n_bins <- 30
scale_factor <- n_obs * age_range / n_bins

# Create histogram + density + mean line + facet by sex
ggplot(df, aes(x = Age, color = Sex, fill = Sex)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = n_bins) +
  geom_density(aes(y = ..density.. * scale_factor), alpha = 0.6, inherit.aes = TRUE) +
  geom_vline(data = mu, aes(xintercept = grp.mean, color = Sex), linetype = "dashed") +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  facet_wrap(~ Sex, ncol = 2, labeller = capitalize_labeller) +
  labs(
    title = "Age Distribution by Sex",
    x = "Age (years)",
    y = "Count"
  ) +
  theme_linedraw() +
  theme(
    plot.title = element_text(face = "bold", size = 12, hjust = 0.5),
    strip.text = element_text(face = "bold", size = 10)
  )

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Load libraries
library(tidyverse)
library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.5.2

# Read cleaned Titanic data
titanic <- read.csv("Total_Cleaned.csv")

# Data preparation
titanic <- titanic %>%
  mutate(
    Survived = factor(Survived, labels = c("No", "Yes")),
    Sex = factor(Sex),
    AgeGroup = case_when(
      Age < 18 ~ "Child",
      Age >= 18 & Age < 60 ~ "Adult",
      Age >= 60 ~ "Senior"
    )
  ) %>%
  drop_na(Age, Sex, Survived)

# Calculate survival percentage by group
survival_summary <- titanic %>%
  group_by(AgeGroup, Sex) %>%
  summarise(SurvivalRate = mean(Survived == "Yes") * 100)

## `summarise()` has grouped output by 'AgeGroup'. You can override using the
## `.groups` argument.

# Perform Fisher's exact test for each AgeGroup
fisher_results <- titanic %>%
  group_by(AgeGroup) %>%
  summarise(
    p.value = fisher.test(table(Sex, Survived))$p.value
  ) %>%
  mutate(p.adj = p.adjust(p.value, method = "fdr"))

fisher_results

## # A tibble: 3 × 3
##   AgeGroup   p.value     p.adj
##   <chr>        <dbl>     <dbl>
## 1 Adult    8.16e-131 2.45e-130
## 2 Child    1.59e-  9 2.39e-  9
## 3 Senior   1.57e-  7 1.57e-  7

# Merge for plotting
plot_data <- survival_summary

# Plot
# Plot
ggplot(plot_data, aes(x = AgeGroup, y = SurvivalRate, color = Sex)) +
  geom_point(size = 4) +
  geom_segment(aes(xend = AgeGroup, y = 0, yend = SurvivalRate, color = Sex), size = 1) +
  geom_text(aes(label = round(SurvivalRate, 1)), vjust = -0.5, size = 4, fontface = "bold") +
  scale_color_manual(values = c("darkorange2", "seagreen3")) +
  scale_y_continuous(limits = c(0, 120), expand = c(0, 0)) +
  labs(
    title = "Survival Rate by Age Group and Sex",
    subtitle = "Fisher’s Exact Test (FDR-adjusted)",
    x = "Age Group",
    y = "Survival (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "right"
  ) +
  geom_signif(
    data = data.frame(
      xmin = c("Child", "Adult", "Child"),
      xmax = c("Adult", "Senior", "Senior"),
      annotations = c("***", "***", "***"),
      y_position = c(110, 115, 120)
    ),
    aes(xmin = xmin, xmax = xmax, annotations = annotations, y_position = y_position),
    inherit.aes = FALSE,
    manual = TRUE,
    tip_length = 0.02,
    textsize = 5,
    vjust = 0.3
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning in geom_signif(data = data.frame(xmin = c("Child", "Adult", "Child"), :
## Ignoring unknown aesthetics: xmin, xmax, annotations, and y_position

library(tidyverse)
library(ggpubr)

# Read Titanic dataset
titanic <- read.csv("Total_Cleaned.csv")

# Data preparation
titanic <- titanic %>%
  mutate(
    Survived = factor(Survived, labels = c("Died", "Survived"))
  )

# Wilcoxon rank-sum test
wilcox_test <- wilcox.test(Fare ~ Survived, data = titanic)
p_value <- wilcox_test$p.value
significance <- ifelse(p_value < 0.0001, "****",
                  ifelse(p_value < 0.001, "***",
                  ifelse(p_value < 0.01, "**",
                  ifelse(p_value < 0.05, "*", "ns"))))

# Plot
ggplot(titanic, aes(x = Survived, y = Fare, fill = Survived)) +
  geom_violin(trim = FALSE, alpha = 0.4, color = NA) +
  geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.6) +
  scale_fill_manual(values = c("#E74C3C", "#27AE60")) +
  labs(
    title = "Fare Distribution by Survival Status",
    subtitle = "Wilcoxon rank-sum test: Survivors paid significantly higher fares",
    x = NULL,
    y = "Fare (USD)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, color = "gray40"),
    legend.position = "right",
    legend.title = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold")
  ) +
  geom_signif(
    comparisons = list(c("Died", "Survived")),
    annotations = significance,
    y_position = max(titanic$Fare, na.rm = TRUE) * 0.9,
    tip_length = 0.02,
    textsize = 5
  )

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_ydensity()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_signif()`).

library(tidyverse)
library(ggforce)

## Warning: package 'ggforce' was built under R version 4.5.2

# Read Titanic data
titanic <- read.csv("Total_Cleaned.csv")

# Prepare data
titanic <- titanic %>%
  mutate(
    Survived = factor(Survived, labels = c("Died", "Survived")),
    Pclass = factor(Pclass),
    Sex = factor(Sex)
  )

# Summarize counts
titanic_pset <- titanic %>%
  count(Sex, Pclass, Survived)

# Convert to ggforce compatible long format
titanic_long <- titanic_pset %>%
  gather_set_data(1:3)

# Plot using geom_parallel_sets
ggplot(titanic_long, aes(x = x, id = id, split = y, value = n)) +
  geom_parallel_sets(aes(fill = Survived), alpha = 0.7, axis.width = 0.25) +
  geom_parallel_sets_axes(axis.width = 0.25, fill = "grey70") +
  geom_parallel_sets_labels(size = 4, angle = 0, colour = "black") +
  scale_fill_manual(values = c("#E74C3C", "#27AE60")) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Passenger Flow: Sex → Pclass → Survival",
    y = "Count",
    fill = "Survival Status"
  ) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.title = element_text(face = "bold")
  )

library(tidyverse)
library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# 1. Load data
titanic <- read.csv("Total_Cleaned.csv")

# 2. Data prep
titanic <- titanic %>%
  mutate(
    Survived = as.numeric(Survived),
    Pclass = factor(Pclass, labels = c("1st", "2nd", "3rd"))
  )

# 3. Fit a logistic regression model
model <- glm(Survived ~ Age * Fare * Pclass, data = titanic, family = binomial)

# 4. Generate prediction grid
age_seq <- seq(0, 80, by = 1)
fare_seq <- seq(0, 200, by = 2)
grid <- expand.grid(Age = age_seq, Fare = fare_seq, Pclass = levels(titanic$Pclass))

grid$predicted <- predict(model, newdata = grid, type = "response")

# 5. Plotly 3D interactive surface
fig <- plot_ly(
  grid,
  x = ~Age,
  y = ~Fare,
  z = ~predicted,
  color = ~Pclass,
  colors = c("#1f77b4", "#2ca02c", "#d62728"),
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 2)
)

fig <- fig %>%
  layout(
    title = "3D Survival Probability Surface by Age, Fare, and Class",
    scene = list(
      xaxis = list(title = "Age (years)"),
      yaxis = list(title = "Fare (USD)"),
      zaxis = list(title = "Predicted Survival Probability")
    )
  )

fig

Question: How do age and ticket fare jointly affect survival probability across different passenger classes on the Titanic?

Explanation: This figure visualizes the modeled survival probability surface derived from a logistic regression including interaction terms between Age, Fare, and Passenger Class. Each panel represents a passenger class (1st, 2nd, 3rd), with color gradients indicating the predicted likelihood of survival.

Findings:

In 1st Class, survival probability remains consistently high (yellow area) across nearly all ages and fares, suggesting strong protection regardless of demographic factors.

2nd Class exhibits a moderate decline in survival among older passengers with lower fares, revealing partial socioeconomic vulnerability.

In 3rd Class, the contrast is striking: survival probability drops sharply with both age and low fare, forming a deep “survival valley” (dark purple). Only younger or higher-paying passengers show improved chances.

Implication: The heatmap reveals that socioeconomic status dramatically shaped survival outcomes. Higher ticket fares and class membership mitigated the negative effects of age, while lower-class passengers—especially older and poorer individuals—faced disproportionately low survival odds. This pattern underscores the intersection of age, wealth, and privilege in determining life-or-death outcomes during the Titanic disaster.

library(tidyverse)

# 1. Load data
titanic <- read.csv("Total_Cleaned.csv")

titanic <- titanic %>%
  mutate(
    Survived = as.numeric(Survived),
    Pclass = factor(Pclass, labels = c("1st", "2nd", "3rd"))
  )

# 2. Fit logistic regression model
model <- glm(Survived ~ Age * Fare * Pclass, data = titanic, family = binomial)

# 3. Create prediction grid
age_seq <- seq(0, 80, by = 1)
fare_seq <- seq(0, 200, by = 2)
grid <- expand.grid(Age = age_seq, Fare = fare_seq, Pclass = levels(titanic$Pclass))

# Predict survival probability
grid$SurvivalProb <- predict(model, newdata = grid, type = "response")

# 4. Visualization
ggplot(grid, aes(x = Age, y = Fare, fill = SurvivalProb)) +
  geom_tile(alpha = 0.9) +
  geom_contour(aes(z = SurvivalProb), color = "white", linewidth = 0.5, bins = 10) +
  scale_fill_viridis_c(option = "C", name = "Survival Probability", limits = c(0, 1)) +
  facet_wrap(~Pclass, ncol = 3) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Survival Probability Landscape by Age, Fare, and Class",
    subtitle = "Modeled from logistic regression (Age × Fare × Pclass)",
    x = "Age (years)",
    y = "Fare (USD)"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, face = "italic"),
    strip.text = element_text(face = "bold", size = 13),
    legend.position = "right"
  )

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Question: Did the number of family members traveling together affect a passenger’s likelihood of survival on the Titanic?

Explanation: This visualization examines how family size—the total number of family members aboard (including the passenger)—relates to survival probability. The LOESS curve highlights a clear nonlinear pattern: survival first increases with family size, peaks at small to medium families, and then declines sharply for larger groups.

Findings:

Single travelers had relatively low survival rates, possibly because they lacked social support during the evacuation.

Small families (2–4 members) showed the highest survival probability, suggesting that mutual assistance and coordination helped them secure places in lifeboats.

Large families (5 or more) faced rapidly decreasing survival chances, likely due to difficulties staying together and finding sufficient lifeboat capacity.

Implication: The inverted-U relationship between family size and survival reveals that social connectedness provided both advantages and constraints. Passengers traveling with a few family members benefited from cooperation, while those in large groups may have been hindered by emotional bonds and logistical challenges. This highlights how even on a sinking ship, human social structure played a decisive role in determining life and death outcomes.

library(tidyverse)

# 1. Load data
titanic <- read.csv("Total_Cleaned.csv")

# 2. Derive FamilySize
titanic <- titanic %>%
  mutate(
    FamilySize = SibSp + Parch + 1,
    Survived = as.numeric(Survived)
  )

# 3. Summarize survival rate by family size
family_summary <- titanic %>%
  group_by(FamilySize) %>%
  summarise(
    SurvivalRate = mean(Survived, na.rm = TRUE),
    Count = n()
  )

# 4. Visualization
ggplot(family_summary, aes(x = FamilySize, y = SurvivalRate)) +
  geom_point(aes(size = Count, color = SurvivalRate), alpha = 0.8) +
  geom_smooth(
    method = "loess", se = TRUE, color = "#0073C2FF",
    fill = "#0073C2FF", alpha = 0.25
  ) +
  scale_color_viridis_c(option = "C", limits = c(0, 1)) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Effect of Family Size on Survival Probability",
    subtitle = "Nonlinear relationship with highest survival among mid-sized families",
    x = "Family Size (including self)",
    y = "Mean Survival Probability",
    color = "Survival Rate",
    size = "Number of Passengers"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, face = "italic"),
    legend.position = "right"
  )

## `geom_smooth()` using formula = 'y ~ x'