Code
# Load Core Libraries
library(tidyverse)
library(gt)
library(gtExtras)The Chi-Squared Test of Independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our actual data to the frequencies we would expect to see if the two variables were completely independent of each other.
We’ll use the historic Titanic dataset to investigate a crucial sociological question:
Was survival on the Titanic associated with passenger class?
In other words, did your ticket status (1st, 2nd, or 3rd Class) structurally affect your chances of surviving the disaster, or was survival purely random across all groups?
We begin by loading the tidyverse ecosystem for data manipulation and visualization, along with gt and gtExtras to render publication-quality tables.
# Load Core Libraries
library(tidyverse)
library(gt)
library(gtExtras)The Titanic dataset is natively built into R. It contains cross-classified counts of passengers and crew members. To maintain our focus on social stratification, we will filter out the crew and expand the compressed table into individual rows.
# Clean, expand, and filter the dataset
titanic_df <- Titanic |>
as_tibble() |>
uncount(n) |>
filter(Class != "Crew")glimpse(titanic_df)Rows: 1,316
Columns: 4
$ Class <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
Before diving into the math, it is vital to visualize the raw distributions. We will construct a horizontal relative-stacked bar chart to evaluate the proportions of life and loss within each ticket tier.
# Calculate internal metrics for positioning labels
titanic_summary <- titanic_df |>
count(Class, Survived) |>
group_by(Class) |>
mutate(
total = sum(n),
prop = n / total,
cumulative = cumsum(prop),
label_pos = cumulative - (prop / 2)
) |>
ungroup()
# Generate the plot
titanic_summary |>
ggplot(aes(x = prop, y = fct_rev(Class), fill = fct_rev(Survived))) +
geom_col(width = 0.6) +
geom_text(aes(x = label_pos, label = n), colour = "white", fontface = "bold", size = 5) +
geom_text(aes(x = 1.08, label = ifelse(Survived == "Yes", paste("Total:", total), "")),
hjust = 0, colour = "grey40", size = 4) +
scale_fill_manual(
values = c("Yes" = "#60a5fa", "No" = "#f87171"),
labels = c("Survived", "Died"),
breaks = c("Yes", "No")
) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15)), labels = NULL) +
labs(
title = "Survival on the Titanic by Passenger Class",
subtitle = "1st class passengers were far more likely to survive",
x = NULL, y = NULL, fill = NULL
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
panel.grid = element_blank(),
axis.text.y = element_text(face = "bold", size = 12),
plot.title = element_text(face = "bold"),
plot.title.position = "plot"
)titanic_df |>
count(Class, Survived) |>
pivot_wider(names_from = Survived, values_from = n) |>
mutate(
Total = No + Yes,
`Survival Rate` = Yes / Total
) |>
rename(Died = No, Survived = Yes) |>
gt() |>
fmt_percent(`Survival Rate`, decimals = 1) |>
data_color(
columns = `Survival Rate`,
palette = c("#f87171", "#fbbf24", "#4ade80"),
domain = c(0.2, 0.7)
) |>
cols_align(align = "center", columns = -Class) |>
tab_header(
title = md("**Titanic Survival by Class**"),
subtitle = "A clear socio-economic gradient from 1st Class to 3rd Class"
) |>
tab_style(
style = cell_text(weight = "bold"),
locations = cells_column_labels()
) |>
tab_options(
table.font.size = px(14),
heading.align = "left"
)| Titanic Survival by Class | ||||
|---|---|---|---|---|
| A clear socio-economic gradient from 1st Class to 3rd Class | ||||
| Class | Died | Survived | Total | Survival Rate |
| 1st | 122 | 203 | 325 | 62.5% |
| 2nd | 167 | 118 | 285 | 41.4% |
| 3rd | 528 | 178 | 706 | 25.2% |
The visualization and the interactive table both reveal a stark disparity: survival rates decreased dramatically as you moved down the socio-economic class hierarchy.
While the raw disparity is visually obvious, we must prove that this distribution did not happen entirely by chance. We cross-tabulate the data and subject it to Pearson’s Chi-squared test.
# Conduct the Test of Independence
titanic_df |>
select(Class, Survived) |>
table() |>
chisq.test()
Pearson's Chi-squared test
data: table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16
The statistical output yields three critical values required for formal reporting:
133.052\[\text{df} = (\text{Rows} - 1) \times (\text{Columns} - 1)\]
\[\text{df} = (3 - 1) \times (2 - 1) = 2\]
This dictates the exact shape of the underlying Chi-Squared distribution curve used to determine significance.
< 2.2e-16 (Extremely close to 0)There is an overwhelmingly significant association between passenger ticket class and survival rates on the Titanic. Ticket class did not simply correlate with survival; it structurally influenced a passenger’s likelihood of making it to a lifeboat. Tragically, those trapped in the lower classes faced vastly diminished odds of survival.
| Metric | High Value | Low Value |
|---|---|---|
| Chi-Square (\(X^2\)) | Strong deviation from randomness | Observed data matches random expectations |
| P-Value | Statistically insignificant (Noise) | Statistically Significant (Real Pattern) |