Show the code
library(tidyverse)
library(gt)
library(gtExtras)The chi-squared test of independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our data to the frequencies we would expect if the two variables were completely independent of each other.
If the observed counts differ substantially from the expected counts, we have evidence that the variables are associated.
We’ll use the famous Titanic dataset to ask: Was survival on the Titanic associated with passenger class?
In other words, did your ticket class (1st, 2nd, or 3rd) affect your chances of surviving the disaster?
library(tidyverse)
library(gt)
library(gtExtras)The Titanic dataset is built into R. It contains counts of passengers and crew aboard the RMS Titanic, cross-classified by class, sex, age, and survival status. We’ll focus on passengers only.
titanic_df <- Titanic |>
as_tibble() |>
uncount(n) |>
filter(Class != "Crew")
glimpse(titanic_df)Rows: 1,316
Columns: 4
$ Class <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
Before running the test, let’s visualise the relationship between class and survival.
titanic_summary <- titanic_df |>
count(Class, Survived) |>
group_by(Class) |>
mutate(
total = sum(n),
prop = n / total,
cumulative = cumsum(prop),
label_pos = cumulative - (prop / 2)
) |>
ungroup()
titanic_summary |>
ggplot(aes(x = prop,
y = fct_rev(Class),
fill = fct_rev(Survived))) +
geom_col(width = 0.6) +
geom_text(aes(x = label_pos, label = n),
colour = "white",
fontface = "bold",
size = 5) +
geom_text(aes(x = 1.08,
label = ifelse(Survived == "Yes",
paste("Total:", total),
"")),
hjust = 0,
colour = "grey40",
size = 4) +
scale_fill_manual(
values = c("Yes" = "#60a5fa", "No" = "#f87171"),
labels = c("Survived", "Died"),
breaks = c("Yes", "No")
) +
scale_x_continuous(
expand = expansion(mult = c(0, 0.15)),
labels = NULL
) +
labs(
title = "Survival on the Titanic by Passenger Class",
subtitle = "1st class passengers were far more likely to survive",
x = NULL,
y = NULL,
fill = NULL
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
panel.grid = element_blank(),
axis.text.y = element_text(face = "bold", size = 12),
plot.title = element_text(face = "bold"),
plot.title.position = "plot"
)titanic_df |>
count(Class, Survived) |>
pivot_wider(names_from = Survived,
values_from = n) |>
mutate(
Total = No + Yes,
`Survival Rate` = Yes / Total
) |>
rename(Died = No, Survived = Yes) |>
gt() |>
fmt_percent(`Survival Rate`, decimals = 1) |>
data_color(
columns = `Survival Rate`,
palette = c("#f87171", "#fbbf24", "#4ade80"),
domain = c(0.2, 0.7)
) |>
cols_align(align = "center",
columns = -Class) |>
tab_header(
title = md("**Titanic Survival by Class**"),
subtitle = "A clear gradient from 1st Class to 3rd Class"
) |>
tab_style(
style = cell_text(weight = "bold"),
locations = cells_column_labels()
) |>
tab_options(
table.font.size = px(14),
heading.align = "left"
)Titanic Survival by Class |
||||
|---|---|---|---|---|
| A clear gradient from 1st Class to 3rd Class | ||||
| Class | Died | Survived | Total | Survival Rate |
| 1st | 122 | 203 | 325 | 62.5% |
| 2nd | 167 | 118 | 285 | 41.4% |
| 3rd | 528 | 178 | 706 | 25.2% |
The visualisation and table both reveal a clear pattern: survival rates decreased dramatically as you moved down the class hierarchy.
titanic_df |>
select(Class, Survived) |>
table() |>
chisq.test()
Pearson's Chi-squared test
data: table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16
The output tells us three key things:
X-squared: The test statistic. Larger values indicate greater deviation from what we’d expect if the variables were independent.
df (degrees of freedom): Calculated as (rows - 1) × (columns - 1) = (3 - 1) × (2 - 1) = 2.
p-value: Extremely small (< 2.2e-16). This is far below the conventional threshold of 0.05.
Conclusion: We reject the null hypothesis of independence. There is a statistically significant association between passenger class and survival on the Titanic. Your class ticket did influence your chances of survival—tragically, those in lower classes were significantly less likely to survive.