Chi-Squared Test of Independence

1 What is a Chi-Squared Test?

The chi-squared test of independence determines whether there is a statistically significant association between two categorical variables. It compares the frequencies we observe in our data to the frequencies we would expect if the two variables were completely independent of each other.

If the observed counts differ substantially from the expected counts, we have evidence that the variables are associated.

2 The Question

We’ll use the famous Titanic dataset to ask: Was survival on the Titanic associated with passenger class?

In other words, did your ticket class (1st, 2nd, or 3rd) affect your chances of surviving the disaster?

3 Setup

Show the code
library(tidyverse)
library(gt)
library(gtExtras)

4 The Data

The Titanic dataset is built into R. It contains counts of passengers and crew aboard the RMS Titanic, cross-classified by class, sex, age, and survival status. We’ll focus on passengers only.

Show the code
titanic_df <- Titanic |>
  as_tibble() |>
  uncount(n) |>
  filter(Class != "Crew")

glimpse(titanic_df)
1
Convert the built-in table to a tibble for easier manipulation
2
Expand the frequency counts into individual rows (one row per person)
3
Exclude crew members to focus on passengers
Rows: 1,316
Columns: 4
$ Class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "3rd"…
$ Sex      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male…
$ Age      <chr> "Child", "Child", "Child", "Child", "Child", "Child", "Child"…
$ Survived <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…

5 Visualising the Question

Before running the test, let’s visualise the relationship between class and survival.

Show the code
titanic_summary <- titanic_df |>
  count(Class, Survived) |>
  group_by(Class) |>
  mutate(
    total = sum(n),
    prop = n / total,
    cumulative = cumsum(prop),
    label_pos = cumulative - (prop / 2)
  ) |>
  ungroup()

titanic_summary |>
  ggplot(aes(x = prop, 
             y = fct_rev(Class),
             fill = fct_rev(Survived))) +
  geom_col(width = 0.6) +
  geom_text(aes(x = label_pos, label = n),
            colour = "white", 
            fontface = "bold",
            size = 5) +
  geom_text(aes(x = 1.08,
                label = ifelse(Survived == "Yes", 
                               paste("Total:", total), 
                               "")),
            hjust = 0, 
            colour = "grey40",
            size = 4) +
  scale_fill_manual(
    values = c("Yes" = "#60a5fa", "No" = "#f87171"),
    labels = c("Survived", "Died"),
    breaks = c("Yes", "No")
  ) +
  scale_x_continuous(
    expand = expansion(mult = c(0, 0.15)),
    labels = NULL
  ) +
  labs(
    title = "Survival on the Titanic by Passenger Class",
    subtitle = "1st class passengers were far more likely to survive",
    x = NULL, 
    y = NULL, 
    fill = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold"),
    plot.title.position = "plot"
  )
1
Count passengers by class and survival status
2
Calculate totals for each class
3
Calculate cumulative proportions for label positioning
4
Reverse factor order so 1st Class appears at the top
5
Add count labels inside each bar segment
6
Add total labels on the right side of bars
7
Expand x-axis to make room for total labels

6 Summary Table

Show the code
titanic_df |>
  count(Class, Survived) |>
  pivot_wider(names_from = Survived,
              values_from = n) |>
  mutate(
    Total = No + Yes,
    `Survival Rate` = Yes / Total
  ) |>
  rename(Died = No, Survived = Yes) |>
  gt() |>
  fmt_percent(`Survival Rate`, decimals = 1) |>
  data_color(
    columns = `Survival Rate`,
    palette = c("#f87171", "#fbbf24", "#4ade80"),
    domain = c(0.2, 0.7)
  ) |>
  cols_align(align = "center",
             columns = -Class) |>
  tab_header(
    title = md("**Titanic Survival by Class**"),
    subtitle = "A clear gradient from 1st Class to 3rd Class"
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.align = "left"
  )
1
Pivot data so Died and Survived are separate columns
2
Calculate survival rate as a proportion
3
Format survival rate as a percentage
4
Apply colour gradient: red (low) → yellow → green (high)
5
Centre-align numeric columns

Titanic Survival by Class

A clear gradient from 1st Class to 3rd Class
Class Died Survived Total Survival Rate
1st 122 203 325 62.5%
2nd 167 118 285 41.4%
3rd 528 178 706 25.2%

The visualisation and table both reveal a clear pattern: survival rates decreased dramatically as you moved down the class hierarchy.

7 The Chi-Squared Test

Show the code
titanic_df |>
  select(Class, Survived) |>
  table() |>
  chisq.test()
1
Select only the two variables of interest
2
Create a contingency table of counts
3
Perform the chi-squared test

    Pearson's Chi-squared test

data:  table(select(titanic_df, Class, Survived))
X-squared = 133.05, df = 2, p-value < 2.2e-16

8 Interpreting the Results

The output tells us three key things:

  • X-squared: The test statistic. Larger values indicate greater deviation from what we’d expect if the variables were independent.

  • df (degrees of freedom): Calculated as (rows - 1) × (columns - 1) = (3 - 1) × (2 - 1) = 2.

  • p-value: Extremely small (< 2.2e-16). This is far below the conventional threshold of 0.05.

Conclusion: We reject the null hypothesis of independence. There is a statistically significant association between passenger class and survival on the Titanic. Your class ticket did influence your chances of survival—tragically, those in lower classes were significantly less likely to survive.